{
"cells": [
{
"cell_type": "markdown",
"id": "a29bd84f-dbd8-49b9-92cd-f9f2b9c63a64",
"metadata": {
"id": "d6c54d8e"
},
"source": [
"\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "8c36f26c-4f74-4eca-a0e0-e812dffb4787",
"metadata": {
"id": "d6c54d8e"
},
"source": [
"# Case Study in NLP\n",
"\n",
"Copyright, NLP from scratch, 2024.\n",
"\n",
"[NLPfor.me](https://www.nlpfor.me)\n",
"\n",
"------------"
]
},
{
"cell_type": "markdown",
"id": "a89bc555-4c5c-49c2-bee1-e9edea062627",
"metadata": {},
"source": [
"In this notebook, we will work through a case study in NLP. You are working with a client which has a website for reviews of products and services, and the product manager on the business side has come with the following ask:\n",
"\n",
"> As we continue to receive an increasing volume of customer feedback on our website, it has become evident that manual categorization of reviews is not only time-consuming but also prone to errors. To improve our efficiency, enhance the overall user experience, and better utilize customer insights, leadership is proposing we develop a machine learning model to automatically categorize reviews into three primary categories: retail, restaurants, and movies, as these make up our largest categories of reviews and searches."
]
},
{
"cell_type": "markdown",
"id": "6aa8a109-4a0c-4808-87f3-162a4daf0434",
"metadata": {},
"source": [
"We will work through developing an MVP for the above business problem. Let's get started!"
]
},
{
"cell_type": "markdown",
"id": "cb5d20f3",
"metadata": {
"id": "d2109366"
},
"source": [
"## Data Loading and Exploration\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "54998616-ac84-4621-9176-f060bf9adb6c",
"metadata": {},
"source": [
"First we will import the \"holy trinity\" of data science in Python: [numpy](https://numpy.org) for working with numeric data, [pandas](https://pandas.pydata.org/) for working with structured data, and [matplotlib](https://matplotlib.org) for data visualization.\n",
"\n",
"For working with processing text data and doing machine learning (with [scikit-learn](https://scikit-learn.org)), we will import the relevant modules and classes as needed."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "13e19005-ca24-4efd-8b2a-e5851a7f5973",
"metadata": {},
"outputs": [],
"source": [
"# Holy trinity\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"id": "e7be2dfd-1695-436f-a6f6-51b903cfa925",
"metadata": {},
"source": [
"Next we will read in the data we will be working with. To build an MVP model, we will be using a dataset which is a combination of reviews from Amazon.com (for retail products / electronics), Rottentomatoes (for movies), and Yelp (for restaurants). This data is available on the [NLP from scratch datasets github repo](https://github.com/nlpfromscratch/datasets/tree/master/amazon_rt_yelp).\n",
"\n",
"We can read the data in directly with `pd.read_csv`, as it can retrieve files directly from a URL! There is no need to download it:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a9352e67-9938-472c-b227-933d7f7851bb",
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('https://raw.githubusercontent.com/nlpfromscratch/datasets/refs/heads/master/amazon_rt_yelp/amazon_rt_yelp.csv')"
]
},
{
"cell_type": "markdown",
"id": "ea613404-ef62-4e66-9b2d-52d069cdc422",
"metadata": {},
"source": [
"Let's take a look at what we are working with:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "8d98409c-a430-4646-92cb-b0cecce2e6e2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
text
\n",
"
source
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
I picked up a jar of fresh-made salsa and chip...
\n",
"
yelp
\n",
"
\n",
"
\n",
"
1
\n",
"
The husband and I had driven by Hula's multipl...
\n",
"
yelp
\n",
"
\n",
"
\n",
"
2
\n",
"
Had some amazing cuisine at Milagro's. The che...
\n",
"
yelp
\n",
"
\n",
"
\n",
"
3
\n",
"
Chill coffee bar. That is the best way to desc...
\n",
"
yelp
\n",
"
\n",
"
\n",
"
4
\n",
"
Fancy shop with great kitchen items that I wo...
\n",
"
yelp
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" text source\n",
"0 I picked up a jar of fresh-made salsa and chip... yelp\n",
"1 The husband and I had driven by Hula's multipl... yelp\n",
"2 Had some amazing cuisine at Milagro's. The che... yelp\n",
"3 Chill coffee bar. That is the best way to desc... yelp\n",
"4 Fancy shop with great kitchen items that I wo... yelp"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "c1497c38-a377-4609-b21c-09f3e5e55130",
"metadata": {},
"source": [
"We can see we have two columns, `text`, which has the free-form review text, and a `source` column which appears to have a text description of the source."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "8510715d-2f8c-40a1-a7b1-716b607281e2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(15000, 2)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "markdown",
"id": "260a425c-0643-4fac-985f-265bcd0ef6ed",
"metadata": {},
"source": [
"There are 15,000 reviews in the dataset. What are the distinct values in the `source` column?"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "c40491a5-d872-41ef-8784-109c684b9fb4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['yelp', 'rottentomatoes', 'amazon'], dtype=object)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['source'].unique()"
]
},
{
"cell_type": "markdown",
"id": "af598cda-452a-4890-91b2-21fa9c9b7588",
"metadata": {},
"source": [
"Let's dive deeper here and check if the distribution of different review types is uniform:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e5f74da6-7e1b-4755-8777-21682090d0a9",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure()\n",
"df['source'].value_counts().plot(kind='barh')\n",
"plt.title('Count of reviews by source')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "0f45e2fc-c127-4bf4-879a-7bec53d4972f",
"metadata": {},
"source": [
"It is, we have 5K reviews for each of the three different review types (amazon, rottentomatoes, and yelp). There is not too much else to be done in terms of exploratory data analysis as we only have text and categorical features, but let's take a look at the length of the reviews and their distribution:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "4731e742-3036-4f91-b1b4-f2888f570c7a",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(10,4))\n",
"df['text'].str.len().hist(bins=500, grid=False)\n",
"plt.xticks(np.arange(0, 10000, 500))\n",
"plt.title('Distribution of Reviews by Length in Characters')\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "6aabb9ca-b74a-448f-b4c6-e8303036d300",
"metadata": {},
"source": [
"We can see the review lengths are not normally distributed but follow a very long-tailed distribution with the vast majority of reviews being between 0-500 characters in length."
]
},
{
"cell_type": "markdown",
"id": "c2e51144-2c5c-43b7-a692-2bde70ca22c8",
"metadata": {},
"source": [
"## Data preprocessing and transformation"
]
},
{
"cell_type": "markdown",
"id": "3dbb0bef-b8b6-4942-a438-be0785c83954",
"metadata": {},
"source": [
"Now that we have taken a look at the data we are working with, we will clean and preprocess the data in order to apply a machine learning model to predict the data source (review category).\n",
"\n",
"Let's take a look at the `text` column:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "b833638e-0e35-4595-ba61-f4029c45e2b8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 I picked up a jar of fresh-made salsa and chip...\n",
"1 The husband and I had driven by Hula's multipl...\n",
"2 Had some amazing cuisine at Milagro's. The che...\n",
"3 Chill coffee bar. That is the best way to desc...\n",
"4 Fancy shop with great kitchen items that I wo...\n",
" ... \n",
"14995 This is a great tablet for the price. Amazon i...\n",
"14996 This tablet is the perfect size and so easy to...\n",
"14997 Purchased this for my son. Has room to upgrade...\n",
"14998 I had some thoughts about getting this for a 5...\n",
"14999 this is a steal, have 8 gb model as well.This ...\n",
"Name: text, Length: 15000, dtype: object"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['text']"
]
},
{
"cell_type": "markdown",
"id": "df0c9f40-e0a4-4ca2-a2d9-a481f7306f50",
"metadata": {},
"source": [
"For the first step in preprocessing we will remove capitals by converting all the reviews to entirely lowercase:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "84599312-df2f-4983-91c9-2cdfa093d334",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 i picked up a jar of fresh-made salsa and chip...\n",
"1 the husband and i had driven by hula's multipl...\n",
"2 had some amazing cuisine at milagro's. the che...\n",
"3 chill coffee bar. that is the best way to desc...\n",
"4 fancy shop with great kitchen items that i wo...\n",
" ... \n",
"14995 this is a great tablet for the price. amazon i...\n",
"14996 this tablet is the perfect size and so easy to...\n",
"14997 purchased this for my son. has room to upgrade...\n",
"14998 i had some thoughts about getting this for a 5...\n",
"14999 this is a steal, have 8 gb model as well.this ...\n",
"Name: text, Length: 15000, dtype: object"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 1. Remove capitals\n",
"df['text'] = df['text'].str.lower()\n",
"df['text']"
]
},
{
"cell_type": "markdown",
"id": "06f1570c-b880-4e4a-97d5-e35d856db61a",
"metadata": {},
"source": [
"Next we remove punctuation:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "1fe8523a-2307-409a-9b79-9406dcbff83b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"!\t\"\t#\t$\t%\t&\t'\t(\t)\t*\t+\t,\t-\t.\t/\t:\t;\t<\t=\t>\t?\t@\t[\t\\\t]\t^\t_\t`\t{\t|\t}\t~\t"
]
}
],
"source": [
"# 2. Removing punctuation\n",
"import string\n",
"\n",
"for mark in string.punctuation:\n",
" print(mark, end='\\t')\n",
" df['text'] = df['text'].str.replace(mark, '')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "9b18627f-1d06-42b0-8890-30ddc9b81678",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 i picked up a jar of freshmade salsa and chips...\n",
"1 the husband and i had driven by hulas multiple...\n",
"2 had some amazing cuisine at milagros the chef ...\n",
"3 chill coffee bar that is the best way to descr...\n",
"4 fancy shop with great kitchen items that i wo...\n",
" ... \n",
"14995 this is a great tablet for the price amazon is...\n",
"14996 this tablet is the perfect size and so easy to...\n",
"14997 purchased this for my son has room to upgrade ...\n",
"14998 i had some thoughts about getting this for a 5...\n",
"14999 this is a steal have 8 gb model as wellthis ha...\n",
"Name: text, Length: 15000, dtype: object"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check\n",
"df['text']"
]
},
{
"cell_type": "markdown",
"id": "54fd3ec4-8c47-41dc-b7f8-ab864584c91e",
"metadata": {},
"source": [
"Let's check if there are any other special characters besides punctuation. Are there newline characters such as `\\n` present?"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "71324fe1-4397-4557-85d4-db71c8f1d5f1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 i picked up a jar of freshmade salsa and chips...\n",
"4 fancy shop with great kitchen items that i wo...\n",
"5 uhoh where am i \\n\\nthe view from atop the mou...\n",
"7 having been to dave busters in california i w...\n",
"10 my husband and i went tonight to teakwoods for...\n",
" ... \n",
"4983 i dont frequent tempe too often but when i do ...\n",
"4986 this would most certainly be my coffee shop if...\n",
"4988 i came here on saturday becuase i had to get m...\n",
"4990 oh orange table what can i say about you i am ...\n",
"4996 where to start\\n\\nthe owners are very kind say...\n",
"Name: text, Length: 2788, dtype: object"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['text'][df['text'].str.contains('\\n')]"
]
},
{
"cell_type": "markdown",
"id": "db8c84f4-8ee7-45a5-bf4d-a79704eced6b",
"metadata": {},
"source": [
"There are. Let's replace all special characters such as newlines and tabs with whitespace using a regular expression:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "6eb15b1c-f531-4cb1-bfed-876fe97daa85",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 i picked up a jar of freshmade salsa and chips...\n",
"1 the husband and i had driven by hulas multiple...\n",
"2 had some amazing cuisine at milagros the chef ...\n",
"3 chill coffee bar that is the best way to descr...\n",
"4 fancy shop with great kitchen items that i wou...\n",
" ... \n",
"14995 this is a great tablet for the price amazon is...\n",
"14996 this tablet is the perfect size and so easy to...\n",
"14997 purchased this for my son has room to upgrade ...\n",
"14998 i had some thoughts about getting this for a 5...\n",
"14999 this is a steal have 8 gb model as wellthis ha...\n",
"Name: text, Length: 15000, dtype: object"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Use regular expression to remove special characters\n",
"df['text'] = df['text'].str.replace(r\"[\\s\\t\\n\\r]+\", \" \", regex=True)\n",
"df['text']"
]
},
{
"cell_type": "markdown",
"id": "365bd624-89c4-45e7-b2a0-ce9a75713126",
"metadata": {},
"source": [
"Great, that is the extent of the normalization we will do in preprocessing. We will not be doing stemming nor removing stopwords as the latter is done in the vectorization and tokenization step.\n",
"\n",
"Let's wrap all the preprocessing in a convenient function here, as we may need to duplicate this process later:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "43cda8e0-86ab-45ce-8a52-711f169b63f5",
"metadata": {},
"outputs": [],
"source": [
"# Wrap all the preprocessing in a function\n",
"def preprocess_text(text_column: pd.Series):\n",
" '''\n",
" This function takes a pandas series, text_column as input.\n",
" '''\n",
" \n",
" # Copy the column\n",
" output_column = text_column.copy()\n",
"\n",
" # Make lower case\n",
" output_column = output_column.str.lower()\n",
" \n",
" # Remove punctuation\n",
" for mark in string.punctuation:\n",
" output_column = output_column.str.replace(mark, '')\n",
"\n",
" # Remove extra whitespace and special characters\n",
" output_column = output_column.str.replace(r\"[\\s\\t\\n\\r]+\", \" \", regex=True)\n",
"\n",
" # Return the updated text column\n",
" return output_column"
]
},
{
"cell_type": "markdown",
"id": "1d659de1-4751-4de6-8cd5-2b7f4782a255",
"metadata": {},
"source": [
"We've used a [docstring](https://peps.python.org/pep-0257/) in our function, so can get usage information using `help`: "
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "f4d2b6d2-20c2-4dd3-9e51-8eb463ee2cc4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Help on function preprocess_text in module __main__:\n",
"\n",
"preprocess_text(text_column: pandas.core.series.Series)\n",
" This function takes a pandas series, text_column as input.\n",
"\n"
]
}
],
"source": [
"help(preprocess_text)"
]
},
{
"cell_type": "markdown",
"id": "90fecf88-a553-4c01-b093-e954972d1e38",
"metadata": {},
"source": [
"We should probably also quickly unit test our function to make sure it works properly. Let's give it 3 reviews with capitalization, punctuation, and special characters and see if they are addressed successfully:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "d593fbf8-4c23-4bf6-a5f1-839b2abb4088",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 this movie was awful\n",
"1 wasnt that bad\n",
"2 never watching a movie from this director again\n",
"dtype: object"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Unit test?\n",
"movie_reviews = [\"This movie was AWFUL.\", \"Wasn't that bad!!!!\", \"Never watching a movie \\n\\n from this director \\t again\"]\n",
"movie_reviews = pd.Series(movie_reviews)\n",
"\n",
"preprocess_text(movie_reviews)"
]
},
{
"cell_type": "markdown",
"id": "6c9029cb-b5ea-46d4-ad77-3019362c708a",
"metadata": {},
"source": [
"We should also check that there are no side effects and that the original dataframe is unmodified by the function:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "9fa8185b-61e1-4325-aa34-b1d1baf35d7f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 This movie was AWFUL.\n",
"1 Wasn't that bad!!!!\n",
"2 Never watching a movie \\n\\n from this director...\n",
"dtype: object"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check that original data is unmodified\n",
"movie_reviews"
]
},
{
"cell_type": "markdown",
"id": "d9d8012b-8560-4625-b51f-5d3b88ee5fd4",
"metadata": {},
"source": [
"We can see that it is, so we are confident that our function does what it needs to without issue."
]
},
{
"cell_type": "markdown",
"id": "92f37f08-44e3-4e30-8164-35f69b4d718c",
"metadata": {},
"source": [
"### Tokenization and Vectorization"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "84b43056-440c-4c4a-91e7-3ca7c5679a54",
"metadata": {},
"source": [
"In this section we will now perform tokenization and vectorization, converting the unstructured text data of the reviews into numeric data that is suitable for analytics and/or machine learning. Both these steps can be done together using the `CountVectorizer` from sklearn to tokenize, vectorize, and remove stopwords from the preprocessed text:\n",
"\n",
"https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\n",
"\n",
"To use the CountVectorizer, we import from the `feature_extraction.text` submodule, then instantiate and fit and transform the reviews:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "2de4bce6-fb38-4533-ad42-e50ef77e87ed",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.feature_extraction.text import CountVectorizer\n",
"\n",
"# Instantiate\n",
"cv = CountVectorizer(stop_words='english')\n",
"\n",
"# Tokenize, vectorize, and remove stopwords\n",
"# DOCUMENT-TERM MATRIX\n",
"dtm = cv.fit_transform(df['text'])"
]
},
{
"cell_type": "markdown",
"id": "f2bb6082-a41c-4e07-9fa4-152422a7366a",
"metadata": {},
"source": [
"Let's check what is returned:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "77501b2b-efef-4d61-ac30-c14542610c0d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<15000x33715 sparse matrix of type ''\n",
"\twith 365222 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dtm"
]
},
{
"cell_type": "markdown",
"id": "92fdc844-5f42-4fa9-a439-8cf65aafb8f6",
"metadata": {},
"source": [
"Our [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix) is returned as a spare dataframe since there are a large number of zeros, since most words only occur in a small number of reviews. Let's plunk into a Pandas Dataframe to make this easier to see:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "29924e42-856b-4159-ac4b-ba05c3ff62df",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
00
\n",
"
007
\n",
"
01
\n",
"
01042012
\n",
"
03342
\n",
"
039
\n",
"
050
\n",
"
06
\n",
"
07092008
\n",
"
075
\n",
"
...
\n",
"
äúshow
\n",
"
äúskills
\n",
"
äústar
\n",
"
äúthings
\n",
"
école
\n",
"
ém
\n",
"
ótimo
\n",
"
ôºå
\n",
"
única
\n",
"
único
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
\n",
"
\n",
"
1
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
\n",
"
\n",
"
2
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
\n",
"
\n",
"
3
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
\n",
"
\n",
"
4
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
\n",
" \n",
"
\n",
"
5 rows × 33715 columns
\n",
"
"
],
"text/plain": [
" 00 007 01 01042012 03342 039 050 06 07092008 075 ... äúshow \\\n",
"0 0 0 0 0 0 0 0 0 0 0 ... 0 \n",
"1 0 0 0 0 0 0 0 0 0 0 ... 0 \n",
"2 0 0 0 0 0 0 0 0 0 0 ... 0 \n",
"3 0 0 0 0 0 0 0 0 0 0 ... 0 \n",
"4 0 0 0 0 0 0 0 0 0 0 ... 0 \n",
"\n",
" äúskills äústar äúthings école ém ótimo ôºå única único \n",
"0 0 0 0 0 0 0 0 0 0 \n",
"1 0 0 0 0 0 0 0 0 0 \n",
"2 0 0 0 0 0 0 0 0 0 \n",
"3 0 0 0 0 0 0 0 0 0 \n",
"4 0 0 0 0 0 0 0 0 0 \n",
"\n",
"[5 rows x 33715 columns]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Put in a nice pandas dataframe\n",
"dtm_df = pd.DataFrame(dtm.toarray(), columns=cv.get_feature_names_out())\n",
"dtm_df.head()"
]
},
{
"cell_type": "markdown",
"id": "009165aa-5129-4c24-ae29-b7720f5b071f",
"metadata": {},
"source": [
"We can see that each column corresponds to the count of each a token as a feature, however, the matrix appears to be mostly zeroes. How many elements are in the matrix?"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "6050eca8-7079-4e65-9a29-f18902485306",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"505,725,000\n"
]
}
],
"source": [
"print(f'{dtm_df.shape[0]*dtm_df.shape[1]:,}')"
]
},
{
"cell_type": "markdown",
"id": "ed181780-a06f-46b3-952d-59958314caaa",
"metadata": {},
"source": [
"There are a whopping ~505M elements in the matrix! But how many of these are non-zero? We use the `365,222` value of non-zero elements returned from the original sparse representation above to calculate this:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "1b835356-4d93-4916-8db9-ad1208f53ecf",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.9992778249048396"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"1-(365222/505725000)"
]
},
{
"cell_type": "markdown",
"id": "222fbe19-6779-457f-962a-b756a9ef45f7",
"metadata": {},
"source": [
"The matrix is ~99.93% zeros! We can see the \"long tail of language\" if we look at the frequency of total occurrences of tokens in the data in a histogram:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "ddec4c86-1d9c-42a8-bfda-9bf5f8ad00c5",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Long tail of language\n",
"plt.figure(figsize=(10, 4))\n",
"(dtm_df.sum()/15000*100.0).hist(bins=100)\n",
"plt.yscale('log')\n",
"plt.title('Distribution of Token Occurrence in Dataset')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "beb5a473-31a1-4956-8fae-235b53d9841f",
"metadata": {},
"source": [
"We have had to resort to using a log scale for the y-axis, since the distribution of tokens is so extremely skewed and following a power law. We could use this as a cumulative histogram to inform our cutoff choice of `min_df` for the CountVectorizer if we wanted to reduce memory usage.\n",
"\n",
"As a result of doing the count vectorization, we get \"free\" text analytics - as we can calculate the total number of occurrences of each token by summing the counts for each column row-wise (as we did above). \n",
"\n",
"What are the 10 most frequently occurring tokens in the dataset?"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "470c297b-ec74-4d50-9d6f-bb94a6ed027a",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAj8AAAGzCAYAAADANnYJAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABMcElEQVR4nO3deVxU1f8/8NewzLAOiOzKJiqKqCkuIS6YJCqa+xapmKmZRmhZUp9UbIFKzS2trNRSc99y1xQXMsMF9xBRwkzFdYZFkeX8/vDL/TksIgXMwH09H495PJh7z733feZemBf33jOjEEIIEBEREcmEkb4LICIiIqpKDD9EREQkKww/REREJCsMP0RERCQrDD9EREQkKww/REREJCsMP0RERCQrDD9EREQkKww/REREJCsMPyRLCQkJaNeuHSwtLaFQKJCYmKjvkgzC9OnToVAo9F2G3ikUCkyfPl3fZfwr4eHhsLKy0ncZzyw8PByenp76LoNkhuGH/pOlS5dCoVBAoVDg8OHDxeYLIeDm5gaFQoGePXtWSg3//PMPpk+f/swBJjc3FwMHDsTdu3fx5Zdf4qeffoKHh0el1GaIsrOzMX36dMTFxem7FCqivMcyEf07JvougGoGMzMzrFy5Eu3bt9eZfuDAAfz9999QqVSVtu1//vkH0dHR8PT0xHPPPVdm+5SUFPz1119YvHgxXnvttUqry1BlZ2cjOjoaABAUFKTfYgzUgwcPYGJS9X8ey3ss1wSLFy9GQUGBvssgmeGZH6oQPXr0wNq1a5GXl6czfeXKlfD394ezs7OeKisuPT0dAGBra1tm26ysrEquhgxFQUEBHj58COBxmNdH+JEjU1PTSv3nqKLl5eXh0aNH+i6D/iOGH6oQQ4cOxZ07d7Bnzx5p2qNHj7Bu3Tq8/PLLJS6TlZWFt99+G25ublCpVPDx8cHMmTMhhNBpt2fPHrRv3x62trawsrKCj48P3n//fQBAXFwcWrduDQAYOXKkdAlu6dKlJW4zPDwcnTp1AgAMHDgQCoVCOvtReK9ESkoKevToAWtra4SFhQF4/MY4Z84cNGnSBGZmZnBycsLYsWNx7949nfULIfDxxx+jbt26sLCwQOfOnXHu3Dl4enoiPDxcalfavTWFlxFTU1N1pu/YsQMdOnSApaUlrK2tERoainPnzhXrm5WVFa5du4Y+ffrAysoKDg4OeOedd5Cfnw8ASE1NhYODAwAgOjpaer1Ku7+lU6dOaN68eYnzfHx8EBISUuK8Jy1cuBBNmjSBSqWCq6srxo8fj/v37xdrd/ToUfTo0QO1atWCpaUlmjVrhrlz5+q0+fPPPzFo0CA4ODjA3NwcPj4++OCDD3Reg5LuHynp9VYoFJgwYQJWrFgh1bdz505p3pOvSeHyly5dQnh4OGxtbWFjY4ORI0ciOztbZ70PHjxAREQE7O3tYW1tjZdeegnXrl0r8z6iZzmW165dC39/f5ibm8Pe3h6vvPIKrl27Vuo6CyUmJsLBwQFBQUHIzMwEAFy7dg2vvvoqnJycoFKp0KRJE/zwww/FalIoFFizZg0++eQT1K1bF2ZmZujSpQsuXbqk0zY5ORn9+/eHs7MzzMzMULduXQwZMgQajeaptRXdZ6mpqVAoFJg5cya+/fZbeHt7Q6VSoXXr1khISCizrwBw//59TJw4EZ6enlCpVKhbty6GDx+O27dvS23S09MxatQoODk5wczMDM2bN8eyZct01vNkLXPmzJFqOX/+vPTarF69Gu+//z6cnZ1haWmJl156CVevXtVZT9Hf/0JBQUHFzr7Onz8fTZo0gYWFBWrVqoVWrVph5cqVz9Rvenb814YqhKenJwICAvDzzz+je/fuAB6/YWs0GgwZMgTz5s3TaS+EwEsvvYT9+/dj1KhReO6557Br1y5MnjwZ165dw5dffgkAOHfuHHr27IlmzZphxowZUKlUuHTpEuLj4wEAjRs3xowZMzB16lSMGTMGHTp0AAC0a9euxDrHjh2LOnXq4NNPP0VERARat24NJycnaX5eXh5CQkLQvn17zJw5ExYWFtJyS5cuxciRIxEREYErV65gwYIFOHnyJOLj42FqagoAmDp1Kj7++GP06NEDPXr0wIkTJ9C1a9f/9J/iTz/9hBEjRiAkJASfffYZsrOzsWjRIrRv3x4nT57UeePIz89HSEgI2rZti5kzZ2Lv3r2YNWsWvL29MW7cODg4OGDRokUYN24c+vbti379+gEAmjVrVuK2hw0bhtGjR+Ps2bPw8/OTpickJODixYv43//+99Tap0+fjujoaAQHB2PcuHFISkrCokWLkJCQoPO67dmzBz179oSLiwveeustODs748KFC9i6dSveeustAMDp06fRoUMHmJqaYsyYMfD09ERKSgp++eUXfPLJJ//qtd23bx/WrFmDCRMmwN7evswbbwcNGgQvLy/ExMTgxIkT+O677+Do6IjPPvtMahMeHo41a9Zg2LBheP7553HgwAGEhoaWWUtZx3Lh8de6dWvExMTg5s2bmDt3LuLj43Hy5MlSz2QmJCQgJCQErVq1wubNm2Fubo6bN2/i+eeflwKgg4MDduzYgVGjRkGr1SIyMlJnHbGxsTAyMsI777wDjUaDzz//HGFhYTh69CiAx//ohISEICcnB2+++SacnZ1x7do1bN26Fffv34eNjU2Z/S9q5cqVyMjIwNixY6FQKPD555+jX79+uHz5snTclCQzMxMdOnTAhQsX8Oqrr6Jly5a4ffs2tmzZgr///hv29vZ48OABgoKCcOnSJUyYMAFeXl5Yu3YtwsPDcf/+femYK7RkyRI8fPgQY8aMgUqlgp2dnRTgP/nkEygUCrz33ntIT0/HnDlzEBwcjMTERJibm5erz4sXL0ZERAQGDBiAt956Cw8fPsTp06dx9OjRUv+JpH9JEP0HS5YsEQBEQkKCWLBggbC2thbZ2dlCCCEGDhwoOnfuLIQQwsPDQ4SGhkrLbdq0SQAQH3/8sc76BgwYIBQKhbh06ZIQQogvv/xSABC3bt0qtYaEhAQBQCxZsuSZat6/f78AINauXaszfcSIEQKAmDJlis70Q4cOCQBixYoVOtN37typMz09PV0olUoRGhoqCgoKpHbvv/++ACBGjBghTZs2bZoo6dev8PW8cuWKEEKIjIwMYWtrK0aPHq3T7saNG8LGxkZnemH9M2bM0GnbokUL4e/vLz2/deuWACCmTZtWbPtF67p//74wMzMT7733nk67iIgIYWlpKTIzM4uto1Dh69G1a1eRn58vTV+wYIEAIH744QchhBB5eXnCy8tLeHh4iHv37ums48nXsWPHjsLa2lr89ddfpbYZMWKE8PDwKLNfQggBQBgZGYlz584Va1/09Slc/tVXX9Vp17dvX1G7dm3p+fHjxwUAERkZqdMuPDy81Nf8SaUdy48ePRKOjo7Cz89PPHjwQJq+detWAUBMnTpVmjZixAhhaWkphBDi8OHDQq1Wi9DQUPHw4UOpzahRo4SLi4u4ffu2znaGDBkibGxspN/hwt+Vxo0bi5ycHKnd3LlzBQBx5swZIYQQJ0+eLPF36lkU3WdXrlwRAETt2rXF3bt3pembN28WAMQvv/zy1PVNnTpVABAbNmwoNq/wWJkzZ44AIJYvXy7Ne/TokQgICBBWVlZCq9Xq1KJWq0V6errOugpfmzp16kjthRBizZo1AoCYO3euNM3Dw0Pn979Qp06dRKdOnaTnvXv3Fk2aNHlq/6hi8LIXVZhBgwbhwYMH2Lp1KzIyMrB169ZS/1vZvn07jI2NERERoTP97bffhhACO3bsAPD/78vZvHlzld0UOW7cOJ3na9euhY2NDV588UXcvn1bevj7+8PKygr79+8HAOzduxePHj3Cm2++qXOJpeh/0eWxZ88e3L9/H0OHDtXZtrGxMdq2bStt+0mvv/66zvMOHTrg8uXL/2r7NjY26N27N37++WfpcmR+fj5Wr16NPn36wNLSstRlC1+PyMhIGBn9/z81o0ePhlqtxrZt2wAAJ0+exJUrVxAZGVns7EXh63jr1i0cPHgQr776Ktzd3Uts82906tQJvr6+z9y+pNf2zp070Gq1ACBdNnvjjTd02r355pv/ukYAOHbsGNLT0/HGG2/AzMxMmh4aGopGjRpJr+WT9u/fj5CQEHTp0gUbNmyQ7qsRQmD9+vXo1asXhBA6x1VISAg0Gg1OnDihs66RI0dCqVTq9BuAdFwVntnZtWtXscuA/9bgwYNRq1atUrdZmvXr16N58+bo27dvsXmFx8r27dvh7OyMoUOHSvNMTU0RERGBzMxMHDhwQGe5/v37S5eLixo+fDisra2l5wMGDICLiwu2b99eRg+Ls7W1xd9///3Ml/fo32P4oQrj4OCA4OBgrFy5Ehs2bEB+fj4GDBhQYtu//voLrq6uOn80gMen/gvnA4//AAYGBuK1116Dk5MThgwZgjVr1lRaEDIxMUHdunV1piUnJ0Oj0cDR0REODg46j8zMTOkG6sKaGzRooLO8g4ODzh/x8khOTgYAvPDCC8W2vXv3bmnbhczMzIr9ka5Vq1axe5PKY/jw4UhLS8OhQ4cAPA41N2/exLBhw566XOHr4ePjozNdqVSiXr160vyUlBQA0LmsVlThG97T2vwbXl5e5WpfNHgV7tfC1/evv/6CkZFRsfXWr1//P1RZ+msJAI0aNZLmF3r48CFCQ0PRokULrFmzRie43Lp1C/fv38e3335b7JgaOXIkABQ7rsrqt5eXFyZNmoTvvvsO9vb2CAkJwVdffVXm/T5PU9Y2S5OSklLmcfLXX3+hQYMGOqEcKP73p9DTjpOiv+8KhQL169cvdt/es3jvvfdgZWWFNm3aoEGDBhg/frx0iZ8qFu/5oQr18ssvY/To0bhx4wa6d+/+TCOqnsbc3BwHDx7E/v37sW3bNuzcuROrV6/GCy+8gN27d8PY2LhiCv8/KpWq2B/EgoICODo6YsWKFSUuU9p/hE9T2tmKwhuTn9w28Pi+n5JGzBUdkVTRrwcAhISEwMnJCcuXL0fHjh2xfPlyODs7Izg4uMK39V896+taqLz3ZJT2+ooiN+nrm0qlQo8ePbB582bs3LlT5zO2Co+pV155BSNGjChx+aL3gD1Lv2fNmoXw8HBs3rwZu3fvRkREBGJiYvD7778X+4fiWRjSa13e46Sopx2XT/azcePGSEpKwtatW7Fz506sX78eCxcuxNSpU6WPp6CKwTM/VKH69u0LIyMj/P7770+9Qc/DwwP//PMPMjIydKb/+eef0vxCRkZG6NKlC2bPno3z58/jk08+wb59+6RLPpX9icTe3t64c+cOAgMDERwcXOxROBqqsObCszWFbt26Vey/1cL/YouOeir6H6e3tzcAwNHRscRt/5vP6Snv62VsbIyXX34Z69atw71797Bp0yYMHTq0zKBV+HokJSXpTH/06BGuXLkizS/s49mzZ0tdV7169cpsAzx+XUsaSVb0da0sHh4eKCgowJUrV3SmFx0ZVZrS9k1pr2XhtKIf0qlQKLBixQp06dIFAwcO1PlASwcHB1hbWyM/P7/EYyo4OBiOjo7PVG9RTZs2xf/+9z8cPHgQhw4dwrVr1/D111//q3X9W97e3mUeJx4eHkhOTi52Brmkvz9lKfr7LoTApUuXdG6eL89xaWlpicGDB2PJkiVIS0tDaGgoPvnkE+ljGKhiMPxQhbKyssKiRYswffp09OrVq9R2PXr0QH5+PhYsWKAz/csvv4RCoZBGjN29e7fYsoUf/paTkwMA0n0nJf1xqQiDBg1Cfn4+Pvroo2Lz8vLypO0GBwfD1NQU8+fP1/nvdM6cOcWWK3zDP3jwoDQtKyur2FDbkJAQqNVqfPrpp8jNzS22nlu3bpW7P4Uj2Mrzeg0bNgz37t3D2LFjkZmZiVdeeaXMZYKDg6FUKjFv3jyd1+P777+HRqORRkC1bNkSXl5emDNnTrGaCpdzcHBAx44d8cMPPyAtLa3ENsDj11Wj0eD06dPStOvXr2Pjxo3P3Nf/onDo/8KFC3Wmz58//5mWL+1YbtWqFRwdHfH1119Lxz3weETlhQsXShxNplQqsWHDBrRu3Rq9evXCH3/8AeBxmO3fvz/Wr19fYkj4N8eUVqst9hlfTZs2hZGRkU69VaF///44depUifu88Fjp0aMHbty4gdWrV0vz8vLyMH/+fFhZWUkfh/EsfvzxR51/4tatW4fr169Lf8OAx8fl77//rjPqc+vWrcWGxN+5c0fnuVKphK+vL4QQJf7+07/Hy15U4Uo7lf6kXr16oXPnzvjggw+QmpqK5s2bY/fu3di8eTMiIyOlcDBjxgwcPHgQoaGh8PDwQHp6OhYuXIi6detKnybt7e0NW1tbfP3117C2toalpSXatm1b7vs5StOpUyeMHTsWMTExSExMRNeuXWFqaork5GSsXbsWc+fOxYABA6TP1ImJiUHPnj3Ro0cPnDx5Ejt27IC9vb3OOrt27Qp3d3eMGjUKkydPhrGxMX744Qc4ODjovLmr1WosWrQIw4YNQ8uWLTFkyBCpzbZt2xAYGFgsQJbF3Nwcvr6+WL16NRo2bAg7Ozv4+fk99T6JFi1awM/PD2vXrkXjxo3RsmXLMrfj4OCAqKgoREdHo1u3bnjppZeQlJSEhQsXonXr1lKAMjIywqJFi9CrVy8899xzGDlyJFxcXPDnn3/i3Llz2LVrFwBg3rx5aN++PVq2bIkxY8bAy8sLqamp2LZtm/R1EEOGDMF7772Hvn37IiIiQvpYgIYNGxa7ibcy+Pv7o3///pgzZw7u3LkjDXW/ePEigLLPuj3tWP7ss88wcuRIdOrUCUOHDpWGunt6emLixIklrs/c3Bxbt27FCy+8gO7du+PAgQPw8/NDbGws9u/fj7Zt22L06NHw9fXF3bt3ceLECezdu7fEfzqeZt++fZgwYQIGDhyIhg0bIi8vDz/99JMUtKrS5MmTsW7dOgwcOBCvvvoq/P39cffuXWzZsgVff/01mjdvjjFjxuCbb75BeHg4jh8/Dk9PT6xbtw7x8fGYM2dOsXsRn8bOzg7t27fHyJEjcfPmTcyZMwf169fH6NGjpTavvfYa1q1bh27dumHQoEFISUnB8uXLpb9zhbp27QpnZ2cEBgbCyckJFy5cwIIFCxAaGlqumugZ6GOIGdUcTw51f5qiQ92FeDyMe+LEicLV1VWYmpqKBg0aiC+++EJn6PKvv/4qevfuLVxdXYVSqRSurq5i6NCh4uLFizrr2rx5s/D19RUmJiZlDnt/2lD3wiHCJfn222+Fv7+/MDc3F9bW1qJp06bi3XffFf/884/UJj8/X0RHRwsXFxdhbm4ugoKCxNmzZ0sc6nr8+HHRtm1boVQqhbu7u5g9e3axoe5P1hwSEiJsbGyEmZmZ8Pb2FuHh4eLYsWNl1l/SMO/ffvtN+Pv7C6VSqTMEu7Qh+EII8fnnnwsA4tNPPy31NSrJggULRKNGjYSpqalwcnIS48aNKzakXYjHw7JffPFFYW1tLSwtLUWzZs3E/PnzddqcPXtW9O3bV9ja2gozMzPh4+MjPvzwQ502u3fvFn5+fkKpVAofHx+xfPnyUoe6jx8/vsSagZKHuhf9yIWS9ldWVpYYP368sLOzE1ZWVqJPnz4iKSlJABCxsbFlvl5PO5ZXr14tWrRoIVQqlbCzsxNhYWHi77//1lm+pOPg9u3bwtfXVzg7O4vk5GQhhBA3b94U48ePF25ubsLU1FQ4OzuLLl26iG+//VZarrTflcIh4IW1Xb58Wbz66qvC29tbmJmZCTs7O9G5c2exd+/eMvtb2lD3L774oljbovulNHfu3BETJkwQderUEUqlUtStW1eMGDFCZ2j/zZs3xciRI4W9vb1QKpWiadOmxf5uPK2Wwtfm559/FlFRUcLR0VGYm5uL0NDQYh/HIIQQs2bNEnXq1BEqlUoEBgaKY8eOFRvq/s0334iOHTuK2rVrC5VKJby9vcXkyZOFRqMps89UPgohDOxOPaIayNPTE0FBQaV+8nR1MHfuXEycOBGpqanFRuLQ0yUmJqJFixZYvny59KnhVL3FxcWhc+fOWLt2bamjWslw8Z4fIiqTEALff/89OnXqxOBThgcPHhSbNmfOHBgZGaFjx456qIiIiuI9P0RUqqysLGzZsgX79+/HmTNnsHnzZn2XZPA+//xzHD9+HJ07d4aJiQl27NiBHTt2YMyYMXBzc9N3eUQEhh8ieopbt27h5Zdfhq2tLd5//3289NJL+i7J4LVr1w579uzBRx99hMzMTLi7u2P69Ok6X8BKRPrFe36IiIhIVnjPDxEREckKww8RERHJCu/5KaKgoAD//PMPrK2tK/1rE4iIiKhiCCGQkZEBV1fXYt/RWBTDTxH//PMPR2QQERFVU1evXi3zy3QZfooo/Ajxq1evQq1W67kaIiIiehZarRZubm7P9FUgDD9FFF7qUqvVDD9ERETVzLPcssIbnomIiEhWGH6IiIhIVhh+iIiISFYYfoiIiEhWGH6IiIhIVhh+iIiISFY41L0UftN2wUhloe8yiIiIapTU2FB9l8AzP0RERCQvDD9EREQkKww/REREJCsMP0RERCQrNTb8xMXFQaFQ4P79+/ouhYiIiAyIwYWfR48e6bsEIiIiqsEqPfxkZGQgLCwMlpaWcHFxwZdffomgoCBERkYCADw9PfHRRx9h+PDhUKvVGDNmDADg8OHD6NChA8zNzeHm5oaIiAhkZWVJ6/3pp5/QqlUrWFtbw9nZGS+//DLS09MBAKmpqejcuTMAoFatWlAoFAgPD6/srhIREVE1UOnhZ9KkSYiPj8eWLVuwZ88eHDp0CCdOnNBpM3PmTDRv3hwnT57Ehx9+iJSUFHTr1g39+/fH6dOnsXr1ahw+fBgTJkyQlsnNzcVHH32EU6dOYdOmTUhNTZUCjpubG9avXw8ASEpKwvXr1zF37twS68vJyYFWq9V5EBERUc2lEEKIylp5RkYGateujZUrV2LAgAEAAI1GA1dXV4wePRpz5syBp6cnWrRogY0bN0rLvfbaazA2NsY333wjTTt8+DA6deqErKwsmJmZFdvWsWPH0Lp1a2RkZMDKygpxcXHo3Lkz7t27B1tb21JrnD59OqKjo4tNd4tcww85JCIiqmCV9SGHWq0WNjY20Gg0UKvVT21bqWd+Ll++jNzcXLRp00aaZmNjAx8fH512rVq10nl+6tQpLF26FFZWVtIjJCQEBQUFuHLlCgDg+PHj6NWrF9zd3WFtbY1OnToBANLS0spVY1RUFDQajfS4evXqv+kqERERVRMG8fUWlpaWOs8zMzMxduxYREREFGvr7u6OrKwshISEICQkBCtWrICDgwPS0tIQEhJS7humVSoVVCrVf6qfiIiIqo9KDT/16tWDqakpEhIS4O7uDuDxZa+LFy+iY8eOpS7XsmVLnD9/HvXr1y9x/pkzZ3Dnzh3ExsbCzc0NwOPLXk9SKpUAgPz8/IroChEREdUQlXrZy9raGiNGjMDkyZOxf/9+nDt3DqNGjYKRkREUCkWpy7333nv47bffMGHCBCQmJiI5ORmbN2+Wbnh2d3eHUqnE/PnzcfnyZWzZsgUfffSRzjo8PDygUCiwdetW3Lp1C5mZmZXZVSIiIqomKn201+zZsxEQEICePXsiODgYgYGBaNy4cYk3LRdq1qwZDhw4gIsXL6JDhw5o0aIFpk6dCldXVwCAg4MDli5dirVr18LX1xexsbGYOXOmzjrq1KmD6OhoTJkyBU5OTjojxYiIiEi+KnW0V0mysrJQp04dzJo1C6NGjarKTT+TwrvFOdqLiIio4hnCaK9Kv+H55MmT+PPPP9GmTRtoNBrMmDEDANC7d+/K3jQRERFRMVUy2mvmzJlISkqCUqmEv78/Dh06BHt7+6rYNBEREZGOKr/sZejKc9qMiIiIDIPBfMghERERkaFh+CEiIiJZYfghIiIiWWH4ISIiIllh+CEiIiJZYfghIiIiWWH4ISIiIllh+CEiIiJZYfghIiIiWWH4ISIiIllh+CEiIiJZYfghIiIiWWH4ISIiIllh+CEiIiJZYfghIiIiWWH4ISIiIllh+CEiIiJZYfghIiIiWTHRdwGGym/aLhipLPRdBhERkcFKjQ3Vdwn/Cs/8EBERkaww/BAREZGsMPwQERGRrMgi/Hh6emLOnDn6LoOIiIgMgCzCDxEREVEhhh8iIiKSlSoNPxkZGQgLC4OlpSVcXFzw5ZdfIigoCJGRkQCAe/fuYfjw4ahVqxYsLCzQvXt3JCcn66xj/fr1aNKkCVQqFTw9PTFr1iyd+enp6ejVqxfMzc3h5eWFFStWVFX3iIiIqBqo0vAzadIkxMfHY8uWLdizZw8OHTqEEydOSPPDw8Nx7NgxbNmyBUeOHIEQAj169EBubi4A4Pjx4xg0aBCGDBmCM2fOYPr06fjwww+xdOlSnXVcvXoV+/fvx7p167Bw4UKkp6eXWlNOTg60Wq3Og4iIiGquKvuQw4yMDCxbtgwrV65Ely5dAABLliyBq6srACA5ORlbtmxBfHw82rVrBwBYsWIF3NzcsGnTJgwcOBCzZ89Gly5d8OGHHwIAGjZsiPPnz+OLL75AeHg4Ll68iB07duCPP/5A69atAQDff/89GjduXGpdMTExiI6OrsyuExERkQGpsjM/ly9fRm5uLtq0aSNNs7GxgY+PDwDgwoULMDExQdu2baX5tWvXho+PDy5cuCC1CQwM1FlvYGAgkpOTkZ+fL63D399fmt+oUSPY2tqWWldUVBQ0Go30uHr1akV0l4iIiAyU7L/eQqVSQaVS6bsMIiIiqiJVduanXr16MDU1RUJCgjRNo9Hg4sWLAIDGjRsjLy8PR48elebfuXMHSUlJ8PX1ldrEx8frrDc+Ph4NGzaEsbExGjVqhLy8PBw/flyan5SUhPv371diz4iIiKg6qbIzP9bW1hgxYgQmT54MOzs7ODo6Ytq0aTAyMoJCoUCDBg3Qu3dvjB49Gt988w2sra0xZcoU1KlTB7179wYAvP3222jdujU++ugjDB48GEeOHMGCBQuwcOFCAICPjw+6deuGsWPHYtGiRTAxMUFkZCTMzc2rqptERERk4Kp0tNfs2bMREBCAnj17Ijg4GIGBgWjcuDHMzMwAPL4B2t/fHz179kRAQACEENi+fTtMTU0BAC1btsSaNWuwatUq+Pn5YerUqZgxYwbCw8OlbRTeRN2pUyf069cPY8aMgaOjY1V2k4iIiAyYQggh9LXxrKws1KlTB7NmzcKoUaP0VYYOrVYLGxsbuEWugZHKQt/lEBERGazU2FB9lyApfP/WaDRQq9VPbVulNzyfPHkSf/75J9q0aQONRoMZM2YAgHRZi4iIiKiyVflor5kzZyIpKQlKpRL+/v44dOgQ7O3tq7oMIiIikim9XvYyROU5bUZERESGoTzv3/xiUyIiIpIVhh8iIiKSFYYfIiIikhWGHyIiIpIVhh8iIiKSFYYfIiIikhWGHyIiIpIVhh8iIiKSFYYfIiIikhWGHyIiIpIVhh8iIiKSFYYfIiIikhWGHyIiIpIVhh8iIiKSFYYfIiIikhWGHyIiIpIVhh8iIiKSFYYfIiIikhUTfRdgqPym7YKRykLfZRARUSVJjQ3VdwmkJzzzQ0RERLLC8ENERESywvBDREREsqLX8OPp6Yk5c+boswQiIiKSGZ75ISIiIllh+CEiIiJZqdTwExQUhAkTJmDChAmwsbGBvb09PvzwQwghSmw/e/ZsNG3aFJaWlnBzc8Mbb7yBzMxMnTbx8fEICgqChYUFatWqhZCQENy7dw8AUFBQgJiYGHh5ecHc3BzNmzfHunXrKrOLREREVM1U+pmfZcuWwcTEBH/88Qfmzp2L2bNn47vvviu5GCMjzJs3D+fOncOyZcuwb98+vPvuu9L8xMREdOnSBb6+vjhy5AgOHz6MXr16IT8/HwAQExODH3/8EV9//TXOnTuHiRMn4pVXXsGBAwdKrS8nJwdarVbnQURERDWXQpR2GqYCBAUFIT09HefOnYNCoQAATJkyBVu2bMH58+fh6emJyMhIREZGlrj8unXr8Prrr+P27dsAgJdffhlpaWk4fPhwsbY5OTmws7PD3r17ERAQIE1/7bXXkJ2djZUrV5a4jenTpyM6OrrYdLfINfyQQyKiGowfclizaLVa2NjYQKPRQK1WP7VtpZ/5ef7556XgAwABAQFITk6WztY8ae/evejSpQvq1KkDa2trDBs2DHfu3EF2djaA/3/mpySXLl1CdnY2XnzxRVhZWUmPH3/8ESkpKaXWFxUVBY1GIz2uXr36H3tMREREhsxgvt4iNTUVPXv2xLhx4/DJJ5/Azs4Ohw8fxqhRo/Do0SNYWFjA3Ny81OUL7w3atm0b6tSpozNPpVKVupxKpXrqfCIiIqpZKj38HD16VOf577//jgYNGsDY2Fhn+vHjx1FQUIBZs2bByOjxCak1a9botGnWrBl+/fXXEi9T+fr6QqVSIS0tDZ06dargXhAREVFNUenhJy0tDZMmTcLYsWNx4sQJzJ8/H7NmzSrWrn79+sjNzcX8+fPRq1cvxMfH4+uvv9ZpExUVhaZNm+KNN97A66+/DqVSif3792PgwIGwt7fHO++8g4kTJ6KgoADt27eHRqNBfHw81Go1RowYUdldJSIiomqg0u/5GT58OB48eIA2bdpg/PjxeOuttzBmzJhi7Zo3b47Zs2fjs88+g5+fH1asWIGYmBidNg0bNsTu3btx6tQptGnTBgEBAdi8eTNMTB5nuI8++ggffvghYmJi0LhxY3Tr1g3btm2Dl5dXZXeTiIiIqolKH+313HPPVauvsCi8W5yjvYiIajaO9qpZDGq0FxEREZEhYfghIiIiWanUy17VUXlOmxEREZFh4GUvIiIiolIw/BAREZGsMPwQERGRrDD8EBERkaww/BAREZGsMPwQERGRrDD8EBERkaww/BAREZGsMPwQERGRrDD8EBERkaww/BAREZGsMPwQERGRrDD8EBERkaww/BAREZGsMPwQERGRrDD8EBERkaww/BAREZGsMPwQERGRrJjouwBD5TdtF4xUFvoug4iIKlBqbKi+SyADwDM/REREJCsMP0RERCQrDD9EREQkKwYZfoKCghAZGQkA8PT0xJw5c6R5CoUCmzZt0ktdREREVP0Z/A3PCQkJsLS01HcZREREVEMYfPhxcHDQdwlERERUgxjkZa8nFb3sVdS0adPg4uKC06dPAwAOHz6MDh06wNzcHG5uboiIiEBWVlYVVUtERESGzuDDT2mEEHjzzTfx448/4tChQ2jWrBlSUlLQrVs39O/fH6dPn8bq1atx+PBhTJgwodT15OTkQKvV6jyIiIio5qqW4ScvLw+vvPIKfv31Vxw+fBj169cHAMTExCAsLAyRkZFo0KAB2rVrh3nz5uHHH3/Ew4cPS1xXTEwMbGxspIebm1tVdoWIiIiqmMHf81OSiRMnQqVS4ffff4e9vb00/dSpUzh9+jRWrFghTRNCoKCgAFeuXEHjxo2LrSsqKgqTJk2Snmu1WgYgIiKiGqxahp8XX3wRP//8M3bt2oWwsDBpemZmJsaOHYuIiIhiy7i7u5e4LpVKBZVKVWm1EhERkWGpluHnpZdeQq9evfDyyy/D2NgYQ4YMAQC0bNkS58+fly6DERERERVVLe/5AYC+ffvip59+wsiRI7Fu3ToAwHvvvYfffvsNEyZMQGJiIpKTk7F58+an3vBMRERE8lItz/wUGjBgAAoKCjBs2DAYGRmhX79+OHDgAD744AN06NABQgh4e3tj8ODB+i6ViIiIDIRCCCH0XYQh0Wq1j0d9Ra6BkcpC3+UQEVEFSo0N1XcJVEkK3781Gg3UavVT21bby15ERERE/wbDDxEREclKtb7npzKdjQ4p87QZERERVT8880NERESywvBDREREssLwQ0RERLLC8ENERESywvBDREREssLwQ0RERLLC8ENERESywvBDREREssLwQ0RERLLC8ENERESywvBDREREssLwQ0RERLLC8ENERESywvBDREREssLwQ0RERLLC8ENERESywvBDREREssLwQ0RERLJiou8CDJXftF0wUlnouwwiomojNTZU3yUQPROe+SEiIiJZYfghIiIiWWH4ISIiIlnRW/gRQmDMmDGws7ODQqFAYmJipWwnKCgIkZGRlbJuIiIiqn70dsPzzp07sXTpUsTFxaFevXqwt7fXVylEREQkI3oLPykpKXBxcUG7du30VQIRERHJkF4ue4WHh+PNN99EWloaFAoFPD09kZOTg4iICDg6OsLMzAzt27dHQkKCznIHDhxAmzZtoFKp4OLigilTpiAvL0+an5WVheHDh8PKygouLi6YNWtWVXeNiIiIDJxews/cuXMxY8YM1K1bF9evX0dCQgLeffddrF+/HsuWLcOJEydQv359hISE4O7duwCAa9euoUePHmjdujVOnTqFRYsW4fvvv8fHH38srXfy5Mk4cOAANm/ejN27dyMuLg4nTpx4ai05OTnQarU6DyIiIqq59BJ+bGxsYG1tDWNjYzg7O8PCwgKLFi3CF198ge7du8PX1xeLFy+Gubk5vv/+ewDAwoUL4ebmhgULFqBRo0bo06cPoqOjMWvWLBQUFCAzMxPff/89Zs6ciS5duqBp06ZYtmyZzpmhksTExMDGxkZ6uLm5VcVLQERERHpiEEPdU1JSkJubi8DAQGmaqakp2rRpgwsXLgAALly4gICAACgUCqlNYGAgMjMz8ffffyMlJQWPHj1C27Ztpfl2dnbw8fF56rajoqKg0Wikx9WrVyu4d0RERGRIZP/1FiqVCiqVSt9lEBERURUxiDM/3t7eUCqViI+Pl6bl5uYiISEBvr6+AIDGjRvjyJEjEEJIbeLj42FtbY26devC29sbpqamOHr0qDT/3r17uHjxYtV1hIiIiAyeQYQfS0tLjBs3DpMnT8bOnTtx/vx5jB49GtnZ2Rg1ahQA4I033sDVq1fx5ptv4s8//8TmzZsxbdo0TJo0CUZGRrCyssKoUaMwefJk7Nu3D2fPnkV4eDiMjAyii0RERGQgDOayV2xsLAoKCjBs2DBkZGSgVatW2LVrF2rVqgUAqFOnDrZv347JkyejefPmsLOzw6hRo/C///1PWscXX3yBzMxM9OrVC9bW1nj77beh0Wj01SUiIiIyQArx5HUkglarfTzqK3INjFQW+i6HiKjaSI0N1XcJJGOF798ajQZqtfqpbXlNiIiIiGSF4YeIiIhkxWDu+TE0Z6NDyjxtRkRERNUPz/wQERGRrDD8EBERkaww/BAREZGsMPwQERGRrDD8EBERkaww/BAREZGsMPwQERGRrDD8EBERkaww/BAREZGsMPwQERGRrDD8EBERkaww/BAREZGsMPwQERGRrDD8EBERkaww/BAREZGsMPwQERGRrDD8EBERkaww/BAREZGsmOi7AEPlN20XjFQW+i6DiEhvUmND9V0CUaXgmR8iIiKSFYYfIiIikhWGHyIiIpIVgws/4eHh6NOnj77LICIiohrK4MLP3LlzsXTp0gpZl6enJ+bMmVMh6yIiIqKaweBGe9nY2Oi7BCIiIqrBDO7Mz5OXvUo6c/Pcc89h+vTpAAAhBKZPnw53d3eoVCq4uroiIiICABAUFIS//voLEydOhEKhgEKhqMJeEBERkaEyuDM/5bF+/Xp8+eWXWLVqFZo0aYIbN27g1KlTAIANGzagefPmGDNmDEaPHl3qOnJycpCTkyM912q1lV43ERER6U+1Dj9paWlwdnZGcHAwTE1N4e7ujjZt2gAA7OzsYGxsDGtrazg7O5e6jpiYGERHR1dVyURERKRnBnfZqzwGDhyIBw8eoF69ehg9ejQ2btyIvLy8cq0jKioKGo1Gely9erWSqiUiIiJDYNDhx8jICEIInWm5ubnSz25ubkhKSsLChQthbm6ON954Ax07dtRpUxaVSgW1Wq3zICIioprLoMOPg4MDrl+/Lj3XarW4cuWKThtzc3P06tUL8+bNQ1xcHI4cOYIzZ84AAJRKJfLz86u0ZiIiIjJsBn3PzwsvvIClS5eiV69esLW1xdSpU2FsbCzNX7p0KfLz89G2bVtYWFhg+fLlMDc3h4eHB4DHo8UOHjyIIUOGQKVSwd7eXl9dISIiIgNh0Gd+oqKi0KlTJ/Ts2ROhoaHo06cPvL29pfm2trZYvHgxAgMD0axZM+zduxe//PILateuDQCYMWMGUlNT4e3tDQcHB311g4iIiAyIQhS9qUbPhg4dCmNjYyxfvlwv29dqtbCxsYFb5BoYqSz0UgMRkSFIjQ3VdwlEz6zw/Vuj0ZR5/67BnPnJy8vD+fPnceTIETRp0kTf5RAREVENZTDh5+zZs2jVqhWaNGmC119/Xd/lEBERUQ1lcJe99K08p82IiIjIMFTLy15EREREVYHhh4iIiGSF4YeIiIhkheGHiIiIZIXhh4iIiGSF4YeIiIhkheGHiIiIZIXhh4iIiGSF4YeIiIhkheGHiIiIZIXhh4iIiGSF4YeIiIhkheGHiIiIZIXhh4iIiGSF4YeIiIhkheGHiIiIZIXhh4iIiGSF4YeIiIhkxUTfBRgqv2m7YKSy0HcZRGQAUmND9V0CEVUgnvkhIiIiWWH4ISIiIllh+CEiIiJZ0Xv4CQoKQmRkpL7LICIiIpnQe/ghIiIiqkoMP0RERCQrBhV+7t27h+HDh6NWrVqwsLBA9+7dkZycDADQarUwNzfHjh07dJbZuHEjrK2tkZ2dDQC4evUqBg0aBFtbW9jZ2aF3795ITU2t6q4QERGRgTKo8BMeHo5jx45hy5YtOHLkCIQQ6NGjB3Jzc6FWq9GzZ0+sXLlSZ5kVK1agT58+sLCwQG5uLkJCQmBtbY1Dhw4hPj4eVlZW6NatGx49elTiNnNycqDVanUeREREVHMZTPhJTk7Gli1b8N1336FDhw5o3rw5VqxYgWvXrmHTpk0AgLCwMGzatEk6y6PVarFt2zaEhYUBAFavXo2CggJ89913aNq0KRo3bowlS5YgLS0NcXFxJW43JiYGNjY20sPNza0quktERER6YjDh58KFCzAxMUHbtm2labVr14aPjw8uXLgAAOjRowdMTU2xZcsWAMD69euhVqsRHBwMADh16hQuXboEa2trWFlZwcrKCnZ2dnj48CFSUlJK3G5UVBQ0Go30uHr1aiX3lIiIiPSpWn29hVKpxIABA7By5UoMGTIEK1euxODBg2Fi8rgbmZmZ8Pf3x4oVK4ot6+DgUOI6VSoVVCpVpdZNREREhsNgwk/jxo2Rl5eHo0ePol27dgCAO3fuICkpCb6+vlK7sLAwvPjiizh37hz27duHjz/+WJrXsmVLrF69Go6OjlCr1VXeByIiIjJ8BnPZq0GDBujduzdGjx6Nw4cP49SpU3jllVdQp04d9O7dW2rXsWNHODs7IywsDF5eXjqXycLCwmBvb4/evXvj0KFDuHLlCuLi4hAREYG///5bH90iIiIiA2Mw4QcAlixZAn9/f/Ts2RMBAQEQQmD79u0wNTWV2igUCgwdOhSnTp2SbnQuZGFhgYMHD8Ld3R39+vVD48aNMWrUKDx8+JBngoiIiAgAoBBCCH0XYUi0Wu3jUV+Ra2CkstB3OURkAFJjQ/VdAhGVofD9W6PRlHnCw6DO/BARERFVNoYfIiIikhWDGe1laM5Gh/A+ISIiohqIZ36IiIhIVhh+iIiISFYYfoiIiEhWGH6IiIhIVhh+iIiISFYYfoiIiEhWGH6IiIhIVhh+iIiISFYYfoiIiEhWGH6IiIhIVhh+iIiISFYYfoiIiEhWGH6IiIhIVhh+iIiISFYYfoiIiEhWGH6IiIhIVhh+iIiISFYYfoiIiEhWTPRdgKHym7YLRioLfZdBRBUgNTZU3yUQkQHhmR8iIiKSFYYfIiIikhWGHyIiIpIVgwk/cXFxUCgUuH//vr5LISIiohpMb+EnKCgIkZGR0vN27drh+vXrsLGx0VdJREREJAMGM9pLqVTC2dlZ32UQERFRDaeXMz/h4eE4cOAA5s6dC4VCAYVCgaVLl+pc9lq6dClsbW2xdetW+Pj4wMLCAgMGDEB2djaWLVsGT09P1KpVCxEREcjPz5fWnZOTg3feeQd16tSBpaUl2rZti7i4OH10k4iIiAyQXs78zJ07FxcvXoSfnx9mzJgBADh37lyxdtnZ2Zg3bx5WrVqFjIwM9OvXD3379oWtrS22b9+Oy5cvo3///ggMDMTgwYMBABMmTMD58+exatUquLq6YuPGjejWrRvOnDmDBg0aFNtGTk4OcnJypOdarbaSek1ERESGQC/hx8bGBkqlEhYWFtKlrj///LNYu9zcXCxatAje3t4AgAEDBuCnn37CzZs3YWVlBV9fX3Tu3Bn79+/H4MGDkZaWhiVLliAtLQ2urq4AgHfeeQc7d+7EkiVL8OmnnxbbRkxMDKKjoyuxt0RERGRIDOaen5JYWFhIwQcAnJyc4OnpCSsrK51p6enpAIAzZ84gPz8fDRs21FlPTk4OateuXeI2oqKiMGnSJOm5VquFm5tbRXaDiIiIDIhBhx9TU1Od5wqFosRpBQUFAIDMzEwYGxvj+PHjMDY21mn3ZGB6kkqlgkqlqsCqiYiIyJDpLfwolUqdG5UrQosWLZCfn4/09HR06NChQtdNRERENYPePufH09MTR48eRWpqKm7fvi2dvfkvGjZsiLCwMAwfPhwbNmzAlStX8McffyAmJgbbtm2rgKqJiIioutNb+HnnnXdgbGwMX19fODg4IC0trULWu2TJEgwfPhxvv/02fHx80KdPHyQkJMDd3b1C1k9ERETVm0IIIfRdhCHRarWwsbGBW+QaGKks9F0OEVWA1NhQfZdARJWs8P1bo9FArVY/ta3BfLcXERERUVVg+CEiIiJZMeih7vp0NjqkzNNmREREVP3wzA8RERHJCsMPERERyQrDDxEREckKww8RERHJCsMPERERyQrDDxEREckKww8RERHJCsMPERERyQrDDxEREckKww8RERHJCsMPERERyQrDDxEREckKww8RERHJCsMPERERyQrDDxEREckKww8RERHJCsMPERERyQrDDxEREcmKib4LMFR+03bBSGWh7zKIqIjU2FB9l0BE1RzP/BAREZGsMPwQERGRrDD8EBERkaxUq/CTmpoKhUKBxMREAEBcXBwUCgXu37+v17qIiIio+qhW4YeIiIjov6qy8PPo0aOq2hQRERFRqSot/AQFBWHChAmIjIyEvb09QkJCcPbsWXTv3h1WVlZwcnLCsGHDcPv2bWmZnTt3on379rC1tUXt2rXRs2dPpKSkPNP2srKyoFarsW7dOp3pmzZtgqWlJTIyMiq0f0RERFQ9VeqZn2XLlkGpVCI+Ph6xsbF44YUX0KJFCxw7dgw7d+7EzZs3MWjQIKl9VlYWJk2ahGPHjuHXX3+FkZER+vbti4KCgjK3ZWlpiSFDhmDJkiU605csWYIBAwbA2tq6xOVycnKg1Wp1HkRERFRzVeqHHDZo0ACff/45AODjjz9GixYt8Omnn0rzf/jhB7i5ueHixYto2LAh+vfvr7P8Dz/8AAcHB5w/fx5+fn5lbu+1115Du3btcP36dbi4uCA9PR3bt2/H3r17S10mJiYG0dHR/7KHREREVN1U6pkff39/6edTp05h//79sLKykh6NGjUCAOnSVnJyMoYOHYp69epBrVbD09MTAJCWlvZM22vTpg2aNGmCZcuWAQCWL18ODw8PdOzYsdRloqKioNFopMfVq1f/TVeJiIiomqjUMz+WlpbSz5mZmejVqxc+++yzYu1cXFwAAL169YKHhwcWL14MV1dXFBQUwM/Pr1w3S7/22mv46quvMGXKFCxZsgQjR46EQqEotb1KpYJKpSpHr4iIiKg6q7Lv9mrZsiXWr18PT09PmJgU3+ydO3eQlJSExYsXo0OHDgCAw4cPl3s7r7zyCt59913MmzcP58+fx4gRI/5z7URERFRzVNlQ9/Hjx+Pu3bsYOnQoEhISkJKSgl27dmHkyJHIz89HrVq1ULt2bXz77be4dOkS9u3bh0mTJpV7O7Vq1UK/fv0wefJkdO3aFXXr1q2E3hAREVF1VWXhx9XVFfHx8cjPz0fXrl3RtGlTREZGwtbWFkZGRjAyMsKqVatw/Phx+Pn5YeLEifjiiy/+1bZGjRqFR48e4dVXX63gXhAREVF1pxBCCH0XUdF++uknTJw4Ef/88w+USmW5ltVqtbCxsYFb5BoYqSwqqUIi+rdSY0P1XQIRGaDC92+NRgO1Wv3UtlV2z09VyM7OxvXr1xEbG4uxY8eWO/gQERFRzVejvtvr888/R6NGjeDs7IyoqCh9l0NEREQGqEZe9vovynPajIiIiAxDed6/a9SZHyIiIqKyMPwQERGRrDD8EBERkaww/BAREZGsMPwQERGRrDD8EBERkaww/BAREZGsMPwQERGRrDD8EBERkaww/BAREZGsMPwQERGRrDD8EBERkaww/BAREZGsMPwQERGRrDD8EBERkaww/BAREZGsMPwQERGRrDD8EBERkayY6LsAQ+U3bReMVBb6LoOoSqTGhuq7BCKiKsMzP0RERCQrDD9EREQkKww/REREJCsGH35SU1OhUCiQmJio71KIiIioBjD4G57d3Nxw/fp12Nvb67sUIiIiqgH0euYnNze3zDbGxsZwdnaGiYnB5zQiIiKqBsodftatW4emTZvC3NwctWvXRnBwMLKysgAA3333HRo3bgwzMzM0atQICxculJYrvHy1evVqdOrUCWZmZli0aBHMzc2xY8cOnW1s3LgR1tbWyM7OLvGy17lz59CzZ0+o1WpYW1ujQ4cOSElJkeY/rQ4iIiKSt3KdTrl+/TqGDh2Kzz//HH379kVGRgYOHToEIQRWrFiBqVOnYsGCBWjRogVOnjyJ0aNHw9LSEiNGjJDWMWXKFMyaNQstWrSAmZkZDh06hJUrV6J79+5SmxUrVqBPnz6wsCj+OTvXrl1Dx44dERQUhH379kGtViM+Ph55eXnSss9SR6GcnBzk5ORIz7VabXleEiIiIqpmyh1+8vLy0K9fP3h4eAAAmjZtCgCYNm0aZs2ahX79+gEAvLy8cP78eXzzzTc6oSMyMlJqAwBhYWEYNmwYsrOzYWFhAa1Wi23btmHjxo0l1vDVV1/BxsYGq1atgqmpKQCgYcOG0vxnraNQTEwMoqOjy/MyEBERUTVWrstezZs3R5cuXdC0aVMMHDgQixcvxr1795CVlYWUlBSMGjUKVlZW0uPjjz/WuRwFAK1atdJ53qNHD5iammLLli0AgPXr10OtViM4OLjEGhITE9GhQwcp+DypPHUUioqKgkajkR5Xr14tz0tCRERE1Uy5zvwYGxtjz549+O2337B7927Mnz8fH3zwAX755RcAwOLFi9G2bdtiyzzJ0tJS57lSqcSAAQOwcuVKDBkyBCtXrsTgwYNLvcHZ3Ny81PoyMzOfuY5CKpUKKpWq1HUSERFRzVLuIVQKhQKBgYEIDAzE1KlT4eHhgfj4eLi6uuLy5csICwsrdxFhYWF48cUXce7cOezbtw8ff/xxqW2bNWuGZcuWITc3t9jZHycnp/9UBxEREdV85Qo/R48exa+//oquXbvC0dERR48exa1bt9C4cWNER0cjIiICNjY26NatG3JycnDs2DHcu3cPkyZNeup6O3bsCGdnZ4SFhcHLy6vYWZsnTZgwAfPnz8eQIUMQFRUFGxsb/P7772jTpg18fHz+Ux1ERERU85Ur/KjVahw8eBBz5syBVquFh4cHZs2aJY3UsrCwwBdffIHJkyfD0tISTZs2RWRkZJnrVSgU0iiyqVOnPrVt7dq1sW/fPkyePBmdOnWCsbExnnvuOQQGBgIAXnvttX9dBxEREdV8CiGE0HcRhkSr1cLGxgZukWtgpCo+1J6oJkqNDdV3CURE/0nh+7dGo4FarX5qW4P/bi8iIiKiisTwQ0RERLLCL8wqxdnokDJPmxEREVH1wzM/REREJCsMP0RERCQrDD9EREQkKww/REREJCsMP0RERCQrDD9EREQkKww/REREJCv8nJ8iCr/tQ6vV6rkSIiIielaF79vP8q1dDD9F3LlzBwDg5uam50qIiIiovDIyMmBjY/PUNgw/RdjZ2QEA0tLSynzxqiOtVgs3NzdcvXq1xn2CdU3uG8D+VWc1uW9Aze5fTe4bULP6J4RARkYGXF1dy2zL8FOEkdHj26BsbGyq/YHwNGq1usb2ryb3DWD/qrOa3DegZvevJvcNqDn9e9aTFrzhmYiIiGSF4YeIiIhkheGnCJVKhWnTpkGlUum7lEpRk/tXk/sGsH/VWU3uG1Cz+1eT+wbU/P6VRiGeZUwYERERUQ3BMz9EREQkKww/REREJCsMP0RERCQrDD9EREQkKww/REREJCsMP0V89dVX8PT0hJmZGdq2bYs//vhD3yWVafr06VAoFDqPRo0aSfMfPnyI8ePHo3bt2rCyskL//v1x8+ZNnXWkpaUhNDQUFhYWcHR0xOTJk5GXl1fVXcHBgwfRq1cvuLq6QqFQYNOmTTrzhRCYOnUqXFxcYG5ujuDgYCQnJ+u0uXv3LsLCwqBWq2Fra4tRo0YhMzNTp83p06fRoUMHmJmZwc3NDZ9//nlldw1A2f0LDw8vti+7deum08ZQ+xcTE4PWrVvD2toajo6O6NOnD5KSknTaVNSxGBcXh5YtW0KlUqF+/fpYunRpZXfvmfoXFBRUbP+9/vrrOm0MsX+LFi1Cs2bNpE/5DQgIwI4dO6T51Xm/AWX3r7rut5LExsZCoVAgMjJSmlbd91+lECRZtWqVUCqV4ocffhDnzp0To0ePFra2tuLmzZv6Lu2ppk2bJpo0aSKuX78uPW7duiXNf/3114Wbm5v49ddfxbFjx8Tzzz8v2rVrJ83Py8sTfn5+Ijg4WJw8eVJs375d2Nvbi6ioqCrvy/bt28UHH3wgNmzYIACIjRs36syPjY0VNjY2YtOmTeLUqVPipZdeEl5eXuLBgwdSm27duonmzZuL33//XRw6dEjUr19fDB06VJqv0WiEk5OTCAsLE2fPnhU///yzMDc3F998843e+zdixAjRrVs3nX159+5dnTaG2r+QkBCxZMkScfbsWZGYmCh69Ogh3N3dRWZmptSmIo7Fy5cvCwsLCzFp0iRx/vx5MX/+fGFsbCx27typ9/516tRJjB49Wmf/aTQag+/fli1bxLZt28TFixdFUlKSeP/994Wpqak4e/asEKJ677dn6V913W9F/fHHH8LT01M0a9ZMvPXWW9L06r7/KgPDzxPatGkjxo8fLz3Pz88Xrq6uIiYmRo9VlW3atGmiefPmJc67f/++MDU1FWvXrpWmXbhwQQAQR44cEUI8fkM2MjISN27ckNosWrRIqNVqkZOTU6m1P03RcFBQUCCcnZ3FF198IU27f/++UKlU4ueffxZCCHH+/HkBQCQkJEhtduzYIRQKhbh27ZoQQoiFCxeKWrVq6fTtvffeEz4+PpXcI12lhZ/evXuXukx16l96eroAIA4cOCCEqLhj8d133xVNmjTR2dbgwYNFSEhIZXdJR9H+CfH4TfTJN52iqlP/atWqJb777rsat98KFfZPiJqx3zIyMkSDBg3Enj17dPpTU/fff8XLXv/n0aNHOH78OIKDg6VpRkZGCA4OxpEjR/RY2bNJTk6Gq6sr6tWrh7CwMKSlpQEAjh8/jtzcXJ1+NWrUCO7u7lK/jhw5gqZNm8LJyUlqExISAq1Wi3PnzlVtR57iypUruHHjhk5fbGxs0LZtW52+2NraolWrVlKb4OBgGBkZ4ejRo1Kbjh07QqlUSm1CQkKQlJSEe/fuVVFvShcXFwdHR0f4+Phg3LhxuHPnjjSvOvVPo9EAAOzs7ABU3LF45MgRnXUUtqnq39Oi/Su0YsUK2Nvbw8/PD1FRUcjOzpbmVYf+5efnY9WqVcjKykJAQECN229F+1eouu+38ePHIzQ0tFgNNW3/VRR+q/v/uX37NvLz83V2PgA4OTnhzz//1FNVz6Zt27ZYunQpfHx8cP36dURHR6NDhw44e/Ysbty4AaVSCVtbW51lnJyccOPGDQDAjRs3Sux34TxDUVhLSbU+2RdHR0ed+SYmJrCzs9Np4+XlVWwdhfNq1apVKfU/i27duqFfv37w8vJCSkoK3n//fXTv3h1HjhyBsbFxtelfQUEBIiMjERgYCD8/P2nbFXEsltZGq9XiwYMHMDc3r4wu6SipfwDw8ssvw8PDA66urjh9+jTee+89JCUlYcOGDU+tvXDe09pUdv/OnDmDgIAAPHz4EFZWVti4cSN8fX2RmJhYI/Zbaf0Dqvd+A4BVq1bhxIkTSEhIKDavJv3eVSSGnxqge/fu0s/NmjVD27Zt4eHhgTVr1lS7A1LuhgwZIv3ctGlTNGvWDN7e3oiLi0OXLl30WFn5jB8/HmfPnsXhw4f1XUqlKK1/Y8aMkX5u2rQpXFxc0KVLF6SkpMDb27uqyywXHx8fJCYmQqPRYN26dRgxYgQOHDig77IqTGn98/X1rdb77erVq3jrrbewZ88emJmZ6bucaoOXvf6Pvb09jI2Ni90Bf/PmTTg7O+upqn/H1tYWDRs2xKVLl+Ds7IxHjx7h/v37Om2e7Jezs3OJ/S6cZygKa3naPnJ2dkZ6errO/Ly8PNy9e7fa9RcA6tWrB3t7e1y6dAlA9ejfhAkTsHXrVuzfvx9169aVplfUsVhaG7VaXSVhv7T+laRt27YAoLP/DLV/SqUS9evXh7+/P2JiYtC8eXPMnTu3xuy30vpXkuq0344fP4709HS0bNkSJiYmMDExwYEDBzBv3jyYmJjAycmpRuy/isbw83+USiX8/f3x66+/StMKCgrw66+/6lwXrg4yMzORkpICFxcX+Pv7w9TUVKdfSUlJSEtLk/oVEBCAM2fO6Lyp7tmzB2q1WjotbAi8vLzg7Oys0xetVoujR4/q9OX+/fs4fvy41Gbfvn0oKCiQ/qAFBATg4MGDyM3Nldrs2bMHPj4+er3kVZK///4bd+7cgYuLCwDD7p8QAhMmTMDGjRuxb9++YpfeKupYDAgI0FlHYZvK/j0tq38lSUxMBACd/Weo/SuqoKAAOTk51X6/laawfyWpTvutS5cuOHPmDBITE6VHq1atEBYWJv1cE/fff6bvO64NyapVq4RKpRJLly4V58+fF2PGjBG2trY6d8AborffflvExcWJK1euiPj4eBEcHCzs7e1Fenq6EOLxMEd3d3exb98+cezYMREQECACAgKk5QuHOXbt2lUkJiaKnTt3CgcHB70Mdc/IyBAnT54UJ0+eFADE7NmzxcmTJ8Vff/0lhHg81N3W1lZs3rxZnD59WvTu3bvEoe4tWrQQR48eFYcPHxYNGjTQGQp+//594eTkJIYNGybOnj0rVq1aJSwsLKpkqPvT+peRkSHeeecdceTIEXHlyhWxd+9e0bJlS9GgQQPx8OFDg+/fuHHjhI2NjYiLi9MZMpydnS21qYhjsXDI7eTJk8WFCxfEV199VSVDbsvq36VLl8SMGTPEsWPHxJUrV8TmzZtFvXr1RMeOHQ2+f1OmTBEHDhwQV65cEadPnxZTpkwRCoVC7N69WwhRvfdbWf2rzvutNEVHr1X3/VcZGH6KmD9/vnB3dxdKpVK0adNG/P777/ouqUyDBw8WLi4uQqlUijp16ojBgweLS5cuSfMfPHgg3njjDVGrVi1hYWEh+vbtK65fv66zjtTUVNG9e3dhbm4u7O3txdtvvy1yc3Oruiti//79AkCxx4gRI4QQj4e7f/jhh8LJyUmoVCrRpUsXkZSUpLOOO3fuiKFDhworKyuhVqvFyJEjRUZGhk6bU6dOifbt2wuVSiXq1KkjYmNj9d6/7Oxs0bVrV+Hg4CBMTU2Fh4eHGD16dLHwbaj9K6lfAMSSJUukNhV1LO7fv18899xzQqlUinr16ulsQ1/9S0tLEx07dhR2dnZCpVKJ+vXri8mTJ+t8Xoyh9u/VV18VHh4eQqlUCgcHB9GlSxcp+AhRvfebEE/vX3Xeb6UpGn6q+/6rDAohhKi680xERERE+sV7foiIiEhWGH6IiIhIVhh+iIiISFYYfoiIiEhWGH6IiIhIVhh+iIiISFYYfoiIiEhWGH6IiIhIVhh+iIiISFYYfoiIiEhWGH6IiIhIVv4fSIyur+VmAFwAAAAASUVORK5CYII=",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"dtm_df.sum().nlargest(10).sort_values().plot(kind='barh')\n",
"plt.title('Most frequently occuring tokens in corpus')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "bf473d57-c205-41b9-ad53-cb3e9f3b19e3",
"metadata": {},
"source": [
"That's interesting, but perhaps these tokens are not distributed across the different review types evenly. Let's instead look at the most frequently occurring tokens per source. To do this, we will use a pandas [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html).\n",
"\n",
"To do this, we need to add the `source` column to the document-term matrix dataframe. We should check if there is already a `source` column (as this could have appeared as a token in the dataset):"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "fc07c845-90af-4fb0-a713-644b242757ed",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 0\n",
"2 0\n",
"3 0\n",
"4 0\n",
" ..\n",
"14995 0\n",
"14996 0\n",
"14997 0\n",
"14998 0\n",
"14999 0\n",
"Name: source, Length: 15000, dtype: int64"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# CHECK if there is a source column already in the dataframe\n",
"# as we do not want to overwrite it\n",
"dtm_df['source']"
]
},
{
"cell_type": "markdown",
"id": "67f0d6b7-9a0c-4172-ac79-983824e7711e",
"metadata": {},
"source": [
"Indeed, there is! Perhaps we should use a more unique name like `review_source` - let's just quickly check it's not in the columns either:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "e0233bad-f256-4c99-8cf1-c783847e3b5a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sum(dtm_df.columns == 'review_source')"
]
},
{
"cell_type": "markdown",
"id": "45587ef0-09af-4489-aed2-b497677135fd",
"metadata": {},
"source": [
"It is not. Finally, we check that the row count is the same between the document-term matrix dataframe `dtm_df` and the original dataframe with the `source` column, `df`, to make sure we are okay to append the latter to the former:"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "16cc3c2f-3c40-428a-9a0e-0634f9cc99b4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dtm_df.shape[0] == df.shape[0]"
]
},
{
"cell_type": "markdown",
"id": "aec98376-4c8f-4fa9-abe5-000fa76c80de",
"metadata": {},
"source": [
"Ok, we can append the column:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "1f0202cc-bfc2-4f16-9173-74961d85dc1e",
"metadata": {},
"outputs": [],
"source": [
"# Append the source column onto the document term dataframe\n",
"dtm_df['review_source'] = df['source']"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "32dd51f7-575d-4048-adf9-8b9ddc21930e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 yelp\n",
"1 yelp\n",
"2 yelp\n",
"3 yelp\n",
"4 yelp\n",
" ... \n",
"14995 amazon\n",
"14996 amazon\n",
"14997 amazon\n",
"14998 amazon\n",
"14999 amazon\n",
"Name: review_source, Length: 15000, dtype: object"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dtm_df['review_source']"
]
},
{
"cell_type": "markdown",
"id": "504a085e-1505-42c1-9beb-3e819350ae73",
"metadata": {},
"source": [
"Now we can group by review type and with a single line of code in pandas, calculate the total number of occurrences of each token by review type!"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "b2e39bea-95a9-4e66-b0c2-02356bb3b454",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
"review_source amazon rottentomatoes yelp\n",
"0 great film place\n",
"1 tablet movie good\n",
"2 love like food\n",
"3 use story great\n",
"4 easy just like\n",
"5 bought good just\n",
"6 kindle characters time\n",
"7 amazon time service\n",
"8 echo comedy really\n",
"9 good films dont\n",
"10 like way love\n",
"11 alexa funny nice\n",
"12 loves little im\n",
"13 screen bad little\n",
"14 price make ive\n",
"15 just movies best\n",
"16 product makes got\n",
"17 kids life pretty\n",
"18 old director try\n",
"19 music best restaurant\n",
"20 works really ordered\n",
"21 apps love didnt\n",
"22 device doesnt people\n",
"23 books work chicken\n",
"24 games theres menu"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grouped_sum.apply(lambda x: pd.Series(x.nlargest(25).index), axis=1).T"
]
},
{
"cell_type": "markdown",
"id": "4901e44a-299a-4f41-b468-3f9ed5ab4251",
"metadata": {},
"source": [
"We can see that the Amazon reviews are mainly about electronics, RottenTomatoes mostly has movie words, and Yelp reviews mostly food and restaurant words, as would be expected.\n",
"\n",
"Finally, we can pull out the entire rows in the lambda function, transpose the result, and plot:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "946cb85b-7f6c-47ab-b766-c1a3dd6a95aa",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"grouped_sum.apply(lambda x: pd.Series(x.nlargest(5)), axis=1).T.plot(kind='barh')\n",
"plt.legend(bbox_to_anchor=(1.05, 1.0), loc='upper left')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "a5c7b4a1-f0b8-4f79-b9a2-076d9988e770",
"metadata": {},
"source": [
"We can see some words only occur frequently in a single category, whereas some others are frequently occurring in multiple review types (*e.g.* 'great', 'like').\n",
"\n",
"Now we will drop t he review source column from our document-term matrix dataframe as we will be proceeding to doing machine learning and it is no longer required:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "2a9e8bdb-551a-4d27-b885-fa79e7f4957d",
"metadata": {},
"outputs": [],
"source": [
"# Drop source from dtm_df\n",
"dtm_df.drop('review_source', axis=1, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "d4e8476a-215e-4996-ba65-9321b6ee630f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check\n",
"'review_source' in dtm_df.columns"
]
},
{
"cell_type": "markdown",
"id": "8ecadfe8-4510-42ad-96f3-8ebf16f018c9",
"metadata": {},
"source": [
"### Preprocessing the target column (`review`)\n",
"\n",
"Now that we have preprocessed the reviews into numeric features for machine learning, we must also adddress the target column, `source`. Here we have a single column with values 'yelp', 'amazon', 'rottentomatoes' - we would like these to be integer values to represent categorical buckets (classes) for supervised learning - using a classification model."
]
},
{
"cell_type": "markdown",
"id": "1b3e695d-c6c8-4eff-b085-0c85a52e9ab8",
"metadata": {},
"source": [
"One way to do this is using the `map` method from pandas and providing a dictionary to map the distinct values in the column to corresponding integer values:"
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "774a39a8-6398-4d1a-9268-965dd7684794",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 0\n",
"2 0\n",
"3 0\n",
"4 0\n",
" ..\n",
"14995 2\n",
"14996 2\n",
"14997 2\n",
"14998 2\n",
"14999 2\n",
"Name: source, Length: 15000, dtype: int64"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Method 1 - using pd.Series.map\n",
"df['source'].map({'yelp':0, 'rottentomatoes':1, 'amazon':2})"
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "2121691a-db8b-45f6-b86a-6789d455fb96",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"source\n",
"0 5000\n",
"1 5000\n",
"2 5000\n",
"Name: count, dtype: int64"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check\n",
"df['source'].map({'yelp':0, 'rottentomatoes':1, 'amazon':2}).value_counts()"
]
},
{
"cell_type": "markdown",
"id": "29552c0a-cfeb-438c-8781-9bbeb908190f",
"metadata": {},
"source": [
"This is fine, because we know the distinct values which appear in our target column, and the number of distinct values (*i.e.* the cardinality of the target) is low (3). What if we don't know all the different values and/or there are a very large number of categories (*i.e.* the target has very high cardinality)? Here using the `map` function would be difficult or perhaps not possible.\n",
"\n",
"Instead, this would be a case where we would use the `LabelEncoder` from scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html\n",
"\n",
"As with other 'transformer' type classes in sklearn, we instantiate and then call `fit_transform`:"
]
},
{
"cell_type": "code",
"execution_count": 53,
"id": "c6612db6-f20d-4a72-a39c-af9824267aaf",
"metadata": {},
"outputs": [],
"source": [
"# import\n",
"from sklearn.preprocessing import LabelEncoder\n",
"\n",
"# instantiate\n",
"le = LabelEncoder()\n",
"\n",
"# fit-transform\n",
"y = pd.Series(le.fit_transform(df['source']))"
]
},
{
"cell_type": "code",
"execution_count": 54,
"id": "0ebc719e-62ce-418e-8d5c-cf23d9fb26dc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 2\n",
"1 2\n",
"2 2\n",
"3 2\n",
"4 2\n",
" ..\n",
"14995 0\n",
"14996 0\n",
"14997 0\n",
"14998 0\n",
"14999 0\n",
"Length: 15000, dtype: int32"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check\n",
"y"
]
},
{
"cell_type": "markdown",
"id": "7b938d20-0a9b-4992-9d72-21f2c9b95a1a",
"metadata": {},
"source": [
"The distinct classes found in the target column are stored in the `.classes_` attribute in the LabelEncoder:"
]
},
{
"cell_type": "code",
"execution_count": 55,
"id": "731dd1d0-d3a5-451d-98e2-e1e5c1a5a77c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['amazon', 'rottentomatoes', 'yelp'], dtype=object)"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"le.classes_"
]
},
{
"cell_type": "code",
"execution_count": 56,
"id": "dad87c01-e4cd-4f23-85f8-762ff4d7c809",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2 5000\n",
"1 5000\n",
"0 5000\n",
"Name: count, dtype: int64"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check\n",
"y.value_counts()"
]
},
{
"cell_type": "markdown",
"id": "ed492e06-1672-4bba-ad43-3e91d69c3536",
"metadata": {},
"source": [
"## Machine Learning\n",
"\n",
"Now that we've completed all the preprocessing we can move forward into the machine learning piece of the case study to build our MVP model. We have already created our target feature, `y`, above, and the document-term matrix dataframe will serve as our input features, `X`:"
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "7596c003-c814-4b33-b8d0-678741d2ce27",
"metadata": {},
"outputs": [],
"source": [
"# Training data\n",
"X = dtm_df\n",
"\n",
"# If on Colab Free or low resource machine, uncomment below\n",
"# Use original sparse document-term matrix for training\n",
"# X = dtm\n",
"\n",
"# y is already assigned"
]
},
{
"cell_type": "markdown",
"id": "1a1cb1b4-384b-4b71-a071-411588b96515",
"metadata": {},
"source": [
"Next, we split our data into training and test sets. Note that this is a naive approach, and more in-depth approaches such as [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) should be used, however this is fine for our MVP. Because we have a sizeable number of observations (15K), we can use a smaller test size - we choose 13.3% so that our train and test sets shake out to round numbers evenly:"
]
},
{
"cell_type": "code",
"execution_count": 58,
"id": "93720653-f610-48cb-a830-c60be55cdd89",
"metadata": {},
"outputs": [],
"source": [
"# Train test split\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1333333)"
]
},
{
"cell_type": "markdown",
"id": "a483f671-43e6-47f8-a005-7c601a895d00",
"metadata": {},
"source": [
"Let's check the train and test set sizes:"
]
},
{
"cell_type": "code",
"execution_count": 59,
"id": "9d842aff-d48d-4e39-be11-d9956b88bdf6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(13000, 33715)"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check\n",
"X_train.shape"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "d66362d8-7fa8-485d-b0bb-de4c8211dc4d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2000, 33715)"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_test.shape"
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "b862e5a6-8cce-4adf-8f28-490e5a79cdc7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"438,295,000\n"
]
}
],
"source": [
"print(f'{X_train.shape[0]*X_train.shape[1]:,}')"
]
},
{
"cell_type": "markdown",
"id": "6d1827f1-4014-424f-bf49-a5aeaa89b57f",
"metadata": {},
"source": [
"We have 13K reviews in the training set, and 2K in test. There are a total of ~438M elements in the training data! How large is the training data array in memory?"
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "77e39ea4-ae7a-49e6-a566-afd4b6aa4183",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3.2656491100788116"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sys\n",
"\n",
"# Size of data in GB\n",
"sys.getsizeof(X_train)/(1024**3)"
]
},
{
"cell_type": "markdown",
"id": "6a0e208c-875c-4fa8-81f7-7c4c9ddf7d03",
"metadata": {},
"source": [
"The training data is ~3.3 GB! This may be challenging for lower-end machines or environments such as Google Colab free. Nonetheless, we press forward and fit our v0 model - the actual machine learning piece is very straightforward with only 3-4 lines of code, as we saw in [Section 3 of the course](https://nlpfor.me):"
]
},
{
"cell_type": "code",
"execution_count": 63,
"id": "03f93872-ba33-49f5-ba66-e54b5047f3a9",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
"
],
"text/plain": [
"LogisticRegression()"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# ML\n",
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"# 1. Instantiate\n",
"logreg = LogisticRegression()\n",
"\n",
"# 2. Fit\n",
"logreg.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"id": "baf10261-710d-4df9-abce-23ed81308ba4",
"metadata": {},
"source": [
"We have fit a very simple 'vanilla' Logistic Regression model. Let's check for overfitting by evaluating on the training and test sets:"
]
},
{
"cell_type": "code",
"execution_count": 64,
"id": "22adbc54-25f3-464b-9046-4b55df030edb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.9966923076923077\n",
"0.972\n"
]
}
],
"source": [
"# 3. Evaluate\n",
"print(logreg.score(X_train, y_train))\n",
"print(logreg.score(X_test, y_test))"
]
},
{
"cell_type": "markdown",
"id": "1f0cbe3a-5242-45d4-bfdc-ea1b61fe1615",
"metadata": {},
"source": [
"While the training score is higher as to be expected, there is not a significant delta between train and test, so there is not extreme overfitting occurring which is good 👍"
]
},
{
"cell_type": "markdown",
"id": "2528919a-bffe-44b2-b0e3-1784305dfea4",
"metadata": {},
"source": [
"## Testing Our Model"
]
},
{
"cell_type": "markdown",
"id": "b47d9517-fbaf-4aff-8017-261e1bfdc5d6",
"metadata": {},
"source": [
"Now that we have fit our MVP model, we can test on some new data from reviews which appeared on the website to \"smoke test\" and see if it performs as expected. Here is a sample of three reviews, in the restaurant, movie, and retail categories:"
]
},
{
"cell_type": "code",
"execution_count": 65,
"id": "969a8a0f-5a1b-4309-93ee-db21c4895a66",
"metadata": {},
"outputs": [],
"source": [
"new_data = ['Absolutely loved this place! Would recommend!', \\\n",
" 'Complete trash... avoid this film at all costs, I hate this director', \\\n",
" 'Garbage product, screen did not power on, was a greasy film on the back immediately after I bought it. Will be returning it.']"
]
},
{
"cell_type": "markdown",
"id": "37b4804d-4e11-4460-aaf6-110b0d391f4e",
"metadata": {},
"source": [
"We need to put this data through the same preprocessing as our training data to fit it into the model - *i.e.* clean / normalize, tokenize, remove stopwords, vectorize, etc.\n",
"\n",
"Fortunately for us, we wrote a reusable function to do the former - so this is now just a simple function call passing in the data as a pandas Series:\n"
]
},
{
"cell_type": "code",
"execution_count": 66,
"id": "83062eef-6567-4101-a36f-ac0c79eadd1c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 absolutely loved this place would recommend\n",
"1 complete trash avoid this film at all costs i ...\n",
"2 garbage product screen did not power on was a ...\n",
"dtype: object"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a pandas series\n",
"new_reviews = pd.Series(new_data)\n",
"\n",
"# Test review function\n",
"new_reviews = preprocess_text(new_reviews)\n",
"\n",
"new_reviews"
]
},
{
"cell_type": "markdown",
"id": "1b67c1a5-a47b-47fa-bd39-7cb185a5eff3",
"metadata": {},
"source": [
"Great. Now we need to count vectorize the processed data using the originally fit count vectorizer, as our model expects numeric input with the 33,715 tokens (features) as from the original training data:"
]
},
{
"cell_type": "code",
"execution_count": 67,
"id": "63a8c805-5569-4d8a-97ca-ab5932df53be",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
00
\n",
"
007
\n",
"
01
\n",
"
01042012
\n",
"
03342
\n",
"
039
\n",
"
050
\n",
"
06
\n",
"
07092008
\n",
"
075
\n",
"
...
\n",
"
äúshow
\n",
"
äúskills
\n",
"
äústar
\n",
"
äúthings
\n",
"
école
\n",
"
ém
\n",
"
ótimo
\n",
"
ôºå
\n",
"
única
\n",
"
único
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
\n",
"
\n",
"
1
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
\n",
"
\n",
"
2
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
...
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
\n",
" \n",
"
\n",
"
3 rows × 33715 columns
\n",
"
"
],
"text/plain": [
" 00 007 01 01042012 03342 039 050 06 07092008 075 ... äúshow \\\n",
"0 0 0 0 0 0 0 0 0 0 0 ... 0 \n",
"1 0 0 0 0 0 0 0 0 0 0 ... 0 \n",
"2 0 0 0 0 0 0 0 0 0 0 ... 0 \n",
"\n",
" äúskills äústar äúthings école ém ótimo ôºå única único \n",
"0 0 0 0 0 0 0 0 0 0 \n",
"1 0 0 0 0 0 0 0 0 0 \n",
"2 0 0 0 0 0 0 0 0 0 \n",
"\n",
"[3 rows x 33715 columns]"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Count vectorize - CV has already been fit\n",
"new_dtm = pd.DataFrame(cv.transform(new_reviews).toarray(), columns=cv.get_feature_names_out())\n",
"\n",
"new_dtm"
]
},
{
"cell_type": "markdown",
"id": "5bb02819-eaf1-4a33-aec7-54b18f474436",
"metadata": {},
"source": [
"Great! Now we can make predictions. The model returns the class labels (0,1,2) and we can convert these back to the text representations using the `inverse_transform` method in the LabelEncoder:"
]
},
{
"cell_type": "code",
"execution_count": 68,
"id": "89460e39-919b-46e1-845e-c527523de448",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['yelp', 'rottentomatoes', 'amazon'], dtype=object)"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# So no we have the data in dtm form\n",
"# Get class labels and apply inverse transform to get ORIGINAL labels\n",
"le.inverse_transform(logreg.predict(new_dtm))"
]
},
{
"cell_type": "markdown",
"id": "72c4ba3c-6c9f-4dab-b49a-b9b318347475",
"metadata": {},
"source": [
"It seems to be working well, as it has predicted the categories we expected. We can look in more detail at the model probabilities predicted by using `predict_proba` instead of `.predict`:"
]
},
{
"cell_type": "code",
"execution_count": 69,
"id": "72f00094-d275-4681-9fee-0b24a8ace99b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
amazon
\n",
"
rottentomatoes
\n",
"
yelp
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
0.111143
\n",
"
0.028364
\n",
"
0.860493
\n",
"
\n",
"
\n",
"
1
\n",
"
0.000286
\n",
"
0.999051
\n",
"
0.000663
\n",
"
\n",
"
\n",
"
2
\n",
"
0.878006
\n",
"
0.121334
\n",
"
0.000660
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" amazon rottentomatoes yelp\n",
"0 0.111143 0.028364 0.860493\n",
"1 0.000286 0.999051 0.000663\n",
"2 0.878006 0.121334 0.000660"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(logreg.predict_proba(new_dtm), columns=le.classes_)"
]
},
{
"cell_type": "markdown",
"id": "3ad791fe-2ec1-4fc2-9aba-01cb752d018c",
"metadata": {},
"source": [
"We can see that the model is extremely confident (~99.9%) that the second review is a movie review, but less confident in the predictions of reviews 1 & 3. Likely there is some ambiguity in review #3, as words like 'film' and 'screen' would also appear in movie reviews."
]
},
{
"cell_type": "markdown",
"id": "9530fc04-ceb4-43ba-b5cb-6e3f32b54beb",
"metadata": {},
"source": [
"## Model introspection\n",
"\n",
"Finally, we can perform model introspection to look at what the model has learned. For Logistic Regression, this means looking at the relative sizes and signs of the coefficients as they related to the different tokens which are predictive of each class.\n",
"\n",
"Since we had 3 classes and 33,715 features, the coffiecients array stored in `.coef_` in the fitted Logistic Regression model has these dimensions. We can nicely put this into a pandas dataframe using the labels from the CountVectorizer and LabelEncoder to make clear:"
]
},
{
"cell_type": "code",
"execution_count": 70,
"id": "f21150d7-7648-45ad-a031-b4efecd3f219",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
amazon
\n",
"
rottentomatoes
\n",
"
yelp
\n",
"
\n",
" \n",
" \n",
"
\n",
"
00
\n",
"
-0.000958
\n",
"
-0.000332
\n",
"
0.001290
\n",
"
\n",
"
\n",
"
007
\n",
"
-0.010189
\n",
"
0.023619
\n",
"
-0.013430
\n",
"
\n",
"
\n",
"
01
\n",
"
0.000004
\n",
"
0.000005
\n",
"
-0.000009
\n",
"
\n",
"
\n",
"
01042012
\n",
"
-0.000501
\n",
"
-0.000036
\n",
"
0.000537
\n",
"
\n",
"
\n",
"
03342
\n",
"
0.000005
\n",
"
0.000005
\n",
"
-0.000010
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
ém
\n",
"
0.000001
\n",
"
0.000004
\n",
"
-0.000005
\n",
"
\n",
"
\n",
"
ótimo
\n",
"
-0.009188
\n",
"
0.013575
\n",
"
-0.004387
\n",
"
\n",
"
\n",
"
ôºå
\n",
"
0.000000
\n",
"
0.000000
\n",
"
0.000000
\n",
"
\n",
"
\n",
"
única
\n",
"
-0.016219
\n",
"
0.023045
\n",
"
-0.006827
\n",
"
\n",
"
\n",
"
único
\n",
"
-0.007613
\n",
"
0.018862
\n",
"
-0.011249
\n",
"
\n",
" \n",
"
\n",
"
33715 rows × 3 columns
\n",
"
"
],
"text/plain": [
" amazon rottentomatoes yelp\n",
"00 -0.000958 -0.000332 0.001290\n",
"007 -0.010189 0.023619 -0.013430\n",
"01 0.000004 0.000005 -0.000009\n",
"01042012 -0.000501 -0.000036 0.000537\n",
"03342 0.000005 0.000005 -0.000010\n",
"... ... ... ...\n",
"ém 0.000001 0.000004 -0.000005\n",
"ótimo -0.009188 0.013575 -0.004387\n",
"ôºå 0.000000 0.000000 0.000000\n",
"única -0.016219 0.023045 -0.006827\n",
"único -0.007613 0.018862 -0.011249\n",
"\n",
"[33715 rows x 3 columns]"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"coef_df = pd.DataFrame(logreg.coef_.T, columns=le.classes_, index=cv.get_feature_names_out())\n",
"coef_df"
]
},
{
"cell_type": "markdown",
"id": "f55f5827-13a8-47b8-ad53-393b747fbad6",
"metadata": {},
"source": [
"We can then visualize this with some matplotlib code to see which tokens are most predictive of each class:"
]
},
{
"cell_type": "code",
"execution_count": 71,
"id": "514b4daa-40cc-49b1-871e-d7370b1512c7",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.subplots(1, 3, figsize=(10, 5))\n",
"\n",
"plt.suptitle('Most Predictive Tokens by Class - Positive')\n",
"\n",
"plt.subplot(1,3,1)\n",
"coef_df['amazon'].nlargest(10).sort_values(ascending=True).plot(kind='barh', color='green')\n",
"plt.title('Amazon')\n",
"\n",
"plt.subplot(1,3,2)\n",
"coef_df['rottentomatoes'].nlargest(10).sort_values(ascending=True).plot(kind='barh', color='green')\n",
"plt.title('Rotten Tomatoes')\n",
"\n",
"plt.subplot(1,3,3)\n",
"coef_df['yelp'].nlargest(10).sort_values(ascending=True).plot(kind='barh', color='green')\n",
"plt.title('Yelp')\n",
"\n",
"plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"id": "6e90d400-547c-4871-87bd-880f406a9c46",
"metadata": {},
"source": [
"As we might expected, words like 'alexa', 'tablet', and 'kindle' are highly predictive of the Amazon class, whereas words like 'food', 'place' and 'delicious' are of the Yelp class, and \"movie-like\" words of the RottenTomatoes class. This makes sense, and also reflects the qualitities of the training data.\n",
"\n",
"Are there any issues with our model, however? What if we look at the most negatively predictive features (tokens) by class?"
]
},
{
"cell_type": "code",
"execution_count": 72,
"id": "c590b976-9d2f-41d1-bad0-3a3af962c454",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.subplots(1, 3, figsize=(10, 5))\n",
"\n",
"plt.suptitle('Most Predictive Tokens by Class - Negative')\n",
"\n",
"plt.subplot(1,3,1)\n",
"coef_df['amazon'].nsmallest(10).sort_values(ascending=True).plot(kind='barh', color='red')\n",
"plt.title('Amazon')\n",
"\n",
"plt.subplot(1,3,2)\n",
"coef_df['rottentomatoes'].nsmallest(10).sort_values(ascending=True).plot(kind='barh', color='red')\n",
"plt.title('Rotten Tomatoes')\n",
"\n",
"plt.subplot(1,3,3)\n",
"coef_df['yelp'].nsmallest(10).sort_values(ascending=True).plot(kind='barh', color='red')\n",
"plt.title('Yelp')\n",
"\n",
"plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"id": "530d03ce-a237-4145-a603-2426d14a8d4c",
"metadata": {},
"source": [
"We can see that the words which are positively predictive for one class are generally in the list of most negatively predictively for the other classes (as each review must belong to one of the three classes). This reflects the nature (bias) of our training data set. From this learning in the MVP, it might make more sense to build a system with individual models for predicting each class if the number of classes is low, or to try using a dataset with a larger number of categories for tagging the site reviews with and see if these issues are addressed in the model training."
]
},
{
"cell_type": "markdown",
"id": "873cbf62-25b2-4ef3-9a97-45c3aa109b93",
"metadata": {
"id": "uOsXgUYiwmCW"
},
"source": [
"